On Consistent and Calibrated Inference about 
the Parameters of Sampling Distributions 

Tomaz Podobnik^)' 2)'* and Tomi Zivko2)'t 

^^Physics Department, University of Ljubljana, Slovenia 
^^"Jozef Stefan" Institute, Ljubljana, Slovenia 



Abstract 

The theory of probability, based on very general rules referred to as the Cox- 
Polya-Jaynes Desiderata, can be used both as a theory of random mass phenomena 
and as a quantitative theory of plausible inference about the parameters of sampling 
distributions. The existing applications of the Desiderata must be extended in order 
to allow for consistent inferences in the limit of complete a priori ignorance about 
the values of the parameters. Since the limits of consistent quantitative inference 
from incomplete information can clearly be established, the developed theory is nec- 
essarily an effective one. It is interesting to note that when applying the Desiderata 
strictly, we find no contradictions between the so-called Bayesian and frequentist 
schools of inductive reasoning. 
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As for prophecies, they will pass away; as for tongues, they will 
cease; as for knowledge, it will pass away. For we know in part and 
we prophesy in part. 

1 Corinthians 13, 8-9. 

1 Introduction 

The term inference (ID, p. 436) stands for two kinds of reasoning (Q, p.xix): deduc- 
tive or demonstrative reasoning whenever enough information is at hand to permit it, and 
inductive or plausible reasoning when not all of the necessary information is available. 
"The difference between the two kinds of reasoning is great and manifold. Demonstrative 
reasoning is safe, beyond controversy, and final. Plausible reasoning is hazardous, con- 
troversial, and provisional. Demonstrative reasoning penetrates the science just as far as 
mathematics does, but is in itself (as mathematics is in itself) incapable of yielding essen- 
tially new knowledge about the world around us. Anything new that we learn about the 
world involves plausible reasoning, which is the only kind of reasoning for which we care 
in everyday affairs. Demonstrative reasoning has rigid standards, codified and clarified 
by logic (formal or demonstrative logic ^), which is the theory of demonstrative reasoning. 
The standards of plausible reasoning are fluid, and there is no theory of such reasoning 
that could be compared to demonstrative logic in clarity or would command comparable 
consensus." So George Polya in the Preface to his Mathematics and Plausible Reasoning 
(Q,p.v). 

In the second volume ||3| of the work he collects patterns of plausible reasoning and 
dissects our intuitive common sense into a set of elementary qualitative desiderata that 
represent basic rules of inductive reasoning. When formulating his views in mathemati- 
cal terms ([4|, Chapter XV), he recognizes his rules to be in a close agreement with the 
calculus of probability as developed by Laplace in the late 18th century [5|, but Polya 
advances a thesis that when applying the calculus of probability to plausible reasoning, it 
should be applied only qualitatively (see, for example, [HI, pp. 136-139), i.e. numerical 
values should be strictly avoided. 

In the present paper we formulate a quantitative theory of inductive reasoning, in 
particular a consistent theory of quantitative inference about the parameters of sampling 
distributions. In SectionElwe adopt the basic qualitative rules of plausible reasoning, the 
so-called Cox-Polya-Jaynes Desiderata, and review some of the well known results of 
their direct applications, such as Cox's and Bayes' Theorems. In addition, we clearly 
establish the lack of such applications in the limit of complete prior ignorance about 
the inferred parameter, with such ignorance representing the natural starting point of an 
inference. In other words, Bayes' Theorem that can be used for updating probabilities, 
cannot directly be used in the step of probability assignment. 

We carefully define the state of complete prior ignorance about the inferred parameter 

'a branch of mathematics, also referred to as deductive logic. 
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in Section|3l In particular, throughout the present paper we never question the specified 
model, i.e. the form of the sampling probability distribution is always known beyond the 
required precision. What we are completely ignorant about at the beginning of reasoning 
is the value of the inferred parameter. 

In SectionlU we define location, dispersion and scale parameters and briefly review 
some of the properties of sampling distributions determined by these parameters. A sep- 
arate section is devoted to the invariance of such distributions, a property that turns out 
to be of decisive importance when constructing a consistent theory of inductive reason- 
ing. We also show that invariance of both the form of the sampling distribution and its 
domain, under a continuous (or Lie) group, is found only in problems of inference about 
parameters that can be reduced to inference about location parameters. 

In Section|6lwe extend the applications of the Desiderata in order to allow for a con- 
sistent assignment and not just for updating of probability distributions for the inferred 
parameters. The so-called Consistency Theorem is obtained by making use of Bayes' 
Theorem and by requiring that if a conclusion can be reasoned out in more than one way, 
then every possible way must lead to the same result, in particular by requiring logically 
independent pieces of information to be commutative. The form of the Consistency The- 
orem is very similar to that of Bayes' Theorem, but there is also a fundamental difference 
between the two since in the former a consistency factor is used instead of the prior 
probability distribution. Hence, the consistency factor cannot be subject to any of the 
requirements such as normalization or invariance with respect to a one-to-one parameter 
transformation, that are perfectly legitimate for well defined probability distributions. 

Instead, the form of the consistency factor is determined in a way that preserves the 
logical consistency of our reasoning. By consistency we mean that, among other things, 
if in two problems of inference our state of knowledge is the same, then we must assign 
the same probabilities in both. In SectionlT] we find that the basic Desiderata uniquely 
determine the form of consistency factors for sampling distributions whose form and do- 
main are invariant under a Lie group Q of transformations. It is therefore only for those 
problems reducible to inference about location parameters that we can give an assurance 
of consistency to our parameter inference. The form of consistency factors for such dis- 
tributions is then determined throughout SectionslSllTOl while in Section[TT] we discuss 
under what circumstances the present theory is guaranteed to be consistent in the case of 
pre-constrained parameters. 

In Section[T2l we make verifiable predictions that are based on the presented theory, 
thus elevating its status above the level of a mere speculation. The predictions are made in 
terms of long run relative frequencies. We show that a consistent inference is necessarily 
also a calibrated one, i.e. that consistently predicted frequencies always coincide with (a 
one-to-one function of) actual frequencies of occurrence. This important result speaks 
in favour of a complete reconciliation between the so-called Bayesian and frequentist 
schools of plausible reasoning. 

In counting experiments the invariance of sampling distributions is clearly missing and 
so the consistency factors cannot be uniquely determined by following the basic Desider- 
ata. The remedy is to collect enough data so that the sampling distribution approaches its 
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dense limit. Until then, our reasoning is necessarily based on some ad hoc prescriptions 
that can (and very often will) lead to logically unacceptable results, as is demonstrated 
in Sectionim In such cases it might therefore be the best to refrain from quantitative 
inferences, i.e. to remain on a qualitative level. 

In Section[T5] we briefly review conceptual and practical difficulties and paradoxes 
caused by using Bayes' Theorem instead of the Consistency Theorem in the limit of 
complete ignorance about the inferred parameters. The problem is contained in the self- 
contradicting non-informative prior probability distributions. Long-lasting arguments 
over this subject led, inter alia, to a split in the theory of inductive reasoning and it is in 
this way that the Bayesian and the frequentist schools emerged. In our view, the splitting 
into (at first glance) almost diametrically opposed schools is highly artificial, provided 
that the two schools strictly obey their basic rules, i.e. that they refrain from using ad 
hoc shortcuts on the course of inference. For regardless how close to our intuitive reason- 
ing these ad hoc procedures may be, how well they may have performed in some other 
previous inferences, and how respectable their names may sound (e.g. the principle of 
insufficient reason or its sophisticated version - the principle of maximum entropy, the 
principle of group invariance, the principle of maximum likelihood, and the principle of 
reduction), they will in general inevitably lead not only to contradictions between the two 
schools of thought, but also to inferences that are neither consistent nor calibrated. 

There are also two appendices to the present paper. The first one contains, for the 
sake of completeness, a proof of Cox's Theorem, while in the second one, the so-called 
marginalization paradox, is extensively discussed. 

2 Basic rules and their applications 

Let n hypothesis or an event be an unambiguous proposition A , i.e. a statement that 
can be either true ox false. As we are in general not certain about either of the two 
possibilities, the classical logic of deductive reasoning [6 1 is to be extended in order to 
allow for plausible or inductive inferences based on incomplete information. 

Let (a state of) information I summarize the information that we have about some 
set A of propositions Ai, called the basis of /, and their relations to each other. The 
domain of / is the logical closure of A, that is, the union of A. A state of information 
is not restricted to containing only deductive information; it can also contain imprecise 
or insufficient information that says nothing with certainty, but still affects one's opinion 
about a certain proposition. Such kind of information can also be updated: we write 
/' = BI for a state of information obtained from / by adding additional information ( 
evidence) that proposition B is true. 

Now, let / be a state of information of a given person and A be a proposition in the 
domain of /. Then, we introduce the {degree of) plausibility {A\I) as a degree of belief 
of the person that A is true given the information in /. We say that / is the knowledge 
base for the assigned plausibility {A\I). In the present paper we assume all considered 
plausibilities to be subject to very general requirements that can be listed in the following 
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three basic Cox-Polya-Jaynes Desiderata ([2J, § 1.7, pp. 17-19): 
/. Degrees of plausibilities are represented by real numbers. 

That is, plausibilities are numerically encoded states of knowledge about propositions. 
Formally, an assigned plausibility can be regarded as a function: 

( I ) : ^ X X > M , 

where X is the set of possible states of information about some set A of propositions. 

Li addition to the first Desideratum, we adopt two natural but nonessential conven- 
tions: 

• a greater degree of belief shall correspond to a greater number; 

• the plausibility of a hypothesis that we are certain about (e.g. the plausibility of a 
tautology) equals 1 . 

By referring to the conventions as being nonessential we mean that we could have equally 
well adopted a convention that the plausibility of a tautology equals a different positive 
constant, or that a greater degree of probability should correspond to a smaller number. 
Nevertheless, according to the above Desideratum and the two conventions, the assigned 
plausibilities can range within an interval [F, 1], where F < 1 is plausibility of the false 
proposition. 

We say that a state of information / is consistent if there is no proposition A for which 
plausibilities {A\I) for A being true and (^|/) for A being false can both equal unity. 
That is, based on consistent information both a proposition and its denial cannot be true. 
In order to avoid ambiguities, we restrict ourselves to considering only plausibilities that 
are assigned upon consistent states of information. 

//. Assignment of plausibilities must be in qualitative correspondence with common sense. 
In our case, the concept of common sense stands for the following conditions: 

• Since plausible reasoning is a generalization of deductive logic, it must be consis- 
tent with the results of Boolean algebra ||7| - the algebra of deductive logic. 

• Microscopic changes in the knowledge base should not cause macroscopic changes 
in the plausibilities assigned. In addition, for every considered proposition A there 
exists some set of possible consistent states of knowledge X such that {A\I), with 
/ e X, can take any of the values within a continuous interval (a, h) C [F, 1] 
(continuity requirement). 

• We assume that the degree of belief {A\I) that A h false depends in some way on 
the plausibility {A\I) that A is true. In addition, when old information / is updated 
into /' in such a way that the plausibility of Ah increased, {A\I') > {A\I), it 
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must produce a decrease in the plausibility that A is false, {A\I') < (^|/). That is, 
we assume that there exists a continuous, twice differentiable, strictly decreasing 
function S of plausibility {A\I), such that 

{A\I) = S[{A\I)] . 

• Plausibility {AB\I), assigned to a hypothesis AB that two non-contradictory hy- 
potheses, A and B , are simultaneously true, is assumed to be completely deter- 
mined by the values of {A\I), iB\AI), {B\I) and iA\BI). Then it can be shown 
(see Lemma 1 in Appendix lAl) that evident inconsistencies are avoided only if (AB\I) 
depends solely on {A\I) and {B\AI), i.e. if there exists a function H such that 

{AB\I) = H[{A\I),{B\AI)]. 

We further require for the function H to be strictly increasing and twice differen- 
tiable in both of its arguments. By strictly increasing we mean that if the knowl- 
edge base / is updated to /' in such a way that the plausibility of A is increased, 
{A\r) > {A\I), but the plausibility iB\AI) remains the same, = {B\AI), 

this can only produce an increase in the plausibility that both A and B are true, 
{AB\I') > (AB\I), in which the equality can hold only if B is impossible given 
A and /. Likewise, given information /" such that {A\r') = {A\I) and {B\AI") > 
we require that 1 /") > iAB\I). 

III. Assignment of plausibilities must be a consistent procedure: 

a) If a conclusion can be reasoned out in more than one way, then every possible 

way must lead to the same result. 

b) When assigning plausibilities, we must always take into account all of the evi- 

dence we have relevant to a hypothesis. We do not arbitrarily ignore some of 
the information and base our conclusion only on what remains. 

c) Equivalent states of knowledge must be always represented by equivalent plau- 

sibility assignments. For example, if in two problems our state of knowledge is 
the same (except perhaps for the labelling of the propositions), then we must 
assign the same plausibilities in both. 

The requirement of consistency plays a special role among various requirements which 
a theoretical system, or an axiomatic system, must satisfy. It can be regarded as the first 
of the requirements to be satisfied by every theoretical system, be it empirical or non- 
empirical. As for an empirical system, however, besides being consistent, it should satisfy 
a further criterion: it must hefalsifiable (||8|, § 24, pp. 91-92). According to this criterion, 
statements, or systems of statements, convey information about the empirical world only 
if they are capable of clashing with experience; or more precisely, only if they can be 
systematically tested, that is to say, if they can be subjected (in accordance with a method- 
ological decision) to tests which might result in their refutation. In so far as a scientific 
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statement speaks about reality, it must be falsifiable: and in so far as it is not falsifiable, 
it does not speak about reality [9|. Since every degree of belief is to be assigned on the 
basis of available evidence, our aim is clearly to formulate an empirical theory of plau- 
sible reasoning, i.e. a theory that speaks about reality. We therefore add an additional 
requirement, an operational Desideratum, to the original Cox-Pdlya-Jaynes Desiderata: 

IV. A theory of plausible inference must specify operations that ensure the falsifiability of 
every assigned degree of plausibility. 

Richard Cox showed ifTOl that the following can be deduced when plausibilities satisfy 
Desiderata I.-III.b: 

1. Suppose that plausibilities {A\I), iB\AI), {B\I), iA\BI) and iAB\I) can be assigned. 

Then there exists a continuous strictly increasing function P of each of these plau- 
sibilities, 

P:[F,1] . [0,1], 

such that 

P{AB\I) = PiA\I) P{B\AI) = P{B\I) P{A\BI) (1) 

and 

P(F) = 0. 

Every compositum of function P and a plausibility assignment {A\I), P[{A\I)'j, 
or P(A\I) in a simplified notation, is referred to as the probability for A to be 
true given available information /, and the above equation ([T]) is referred to as the 
product rule. Note that every probability is at the same time also a plausibility, i.e. 
it is consistent with the basic Desiderata. 

2. Probabilities P{A\I) and P(A\I) sum up to the probability of a certain event, i.e. sum 

up to unity: 

PiA\I) + PiA\I) = l, (2) 
which is referred to as the sum rule. 

The above results are usually referred to as Cox's Theorem (for a proof of the Theorem 
see Appendix A). Note that the product and the sum rule, being only relations between 
probabilities, do not of themselves assign numerical values to any of the probabilities 
arising in a specific problem. The only numerical values, considered thus far, are those 
corresponding to certainty and impossibility, one and zero, respectively, of which the 
former is a mere consequence of a convention, adopted along with Desideratum/, rather 
than required by the rules of the Theorem. Moreover, it is hardly to be supposed that 
every reasonable expectation should have a precise numerical value HIOI , nor is there any 
guarantee that every state of information about a particular proposition A will meet the 
continuity requirement of the common sense Desideratum. 

The product and the sum rule are unique in the sense that any set of rules for manip- 
ulating our degrees of belief, represented by real numbers, is either isomorphic to (H)) and 
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(|2l), i.e. different from dU) and Q only in form but not in content, or inconsistent. Thus, 
we could have chosen any other set of plausibilities P that are one-to-one functions of the 
corresponding probabilities P, and then adequately adapt the product and the sum rule. 
For example, if we choose P(A\I) such that 

P{A\I) = ^/P(A\I) 

with a being an arbitrary positive number, the corresponding product rule for P remains 
the same while the sum rule reads: 

P"(A|/) + P"(A|/) = 1 . 

As another example, we could have chosen zero to represent the plausibility of a proposi- 
tion that we are certain about, and a plausibility P such that: 

P{A\I) = lnP{A\I) . 

Then, the appropriate product and sum rules for P would have read: 

P{AB\I) = PiA\I) + P{B\AI) = P{B\I) + P{A\BI) 

and ^ ^ 

exp{P(A|J)} + exp{P(A|J)} = 1 . 

The freedom to choose an arbitrary plausibility function to represent our degree of 
belief is analogous to gauge invariance in field theories where potentials (i.e. functions 
that the fields are expressed by) are not rigidly fixed. The predictions of field theories 
are unchanged if the potentials are transformed according to specific rules, i.e. if the 
potentials are subjects to gauge transformations. Then we choose one particular form of 
potential, i.e. we choose a particular gauge, not because it is more correct than any other, 
but because it is more convenient for the particular problem that we are solving within a 
filed theory. For the same reason we choose probabilities and not any other plausibilities 
to represent our degrees of belief: not because they are more correct, but because it is 
for probabilities that the product and the sum rule take the simplest forms. We comment 
on the choice of probability, that we adhere to throughout the present paper, again in 
Section[T21when we discuss the relation between probability and frequency. 

Once the probabilities are chosen from all possible plausibility functions, i.e once the 
gauge is fixed, the incompleteness of the concept of plausibility is removed: the product 
and the sum rules, dU) and Q, are the fundamental equations of probability theory, while 
all other equations for manipulating probabilities follow from their repeated applications. 
For example, it is in this way that we obtain the general sum rule that either A or P is 
true: 

P{A + B\I) = P{A\I) + P(P|/) - P{AB\I) . (3) 

Suppose now that propositions Ai, A2, An form an exhaustive set of mutually ex- 
clusive propositions. The propositions are mutually exclusive if the evidence / implies 
that no two of them can be true simultaneously, 

P{AA,\I) = OhTt^j , 
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and exhaustive if one of them must be true, 

n n 
i=l i=l 

A classical textbook example of such a set would be six hypotheses Ai arising from toss- 
ing a die, where index i corresponds to the particular number thrown. 

The hypotheses of a set are unambiguously classified by assigning one or more nu- 
merical indices. Deciding between the hypotheses Ai and estimating the index i are prac- 
tically the same thing: 

P(A,|/)=P(^|/), (4) 
with the corresponding normalization 

n 

5^p(^|/) = l. (5) 

i=l 

We denote probabilities by capital P when arguments are propositions, and by small p 
when arguments are numerical values. By assigning probability to every possible value 
of the index i we specify how our degree of belief is distributed among the hypotheses of 
the set Ai, i.e. we specify the (sampling) probability distribution for i. 

The distribution of one's degree of belief in hypotheses labelled by different values 
of index i may be equivalently represented by the cumulative distribution function (cdf), 
defined as 

i 

FW^^p(j|/) = P(j<z|/), (6) 

j=ia 

where permissible values of the index i range from ia to %. The probability for i taking a 
value between ii and ^2 can thus be expressed by cdf's simply as: 

12 *2 il 

P{h < I < ^2\I) = = - = Fit,) - Fii,) . 

i=il i=ia i=ia 

In addition, for an exhaustive sets of hypotheses the normalization condition Q implies: 

i=ia 

In many cases of practical importance the hypotheses of a set become very numerous 
and dense. For example, when predicting the decay time of an unstable particle, we start 
with a countable set of hypotheses Ai that the decay time of that specific particle would 
be, for instance, i seconds. But such a set is not an exhaustive one since the decay time 
could also be i + i or i + i seconds. Further refinement of the original propositions leads 
to a dense set of hypotheses where neighbouring hypotheses, i.e. hypotheses with nearly 
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the same index value, become barely distinguishable. In such cases there cannot be a 
sharply defined hypothesis that is strongly favoured over all others. Instead, it only makes 
sense to consider probabilities for index z in a certain interval of its permissible range. 
In this way index i transforms into a continuous variable xi with a continuous sampling 
probability distribution: 

p{xi\I) = P{x e {xi,Xi + dx)\l) . (7) 
Every continuous distribution can be expressed by a probability density function (pdf), 

f{x\iy. 

p{x\I) = f{x\I)dx. (8) 
Using pdf 's we can now rewrite the product rule dU) as 

/(X1X2I/) = f{x2\xj) = f{x,\I) f{Xi\x2l) , (9) 

and the normalization ^ by replacing summation over discrete indices i by integration 
over a dense domain x: 

j f{x'\I)dx' = 1 . (10) 

X 

The sum in the cdf Q for a discrete variable i is replaced by an integral for a contin- 
uous variable x: ^ 

F{x)= j f{x'\I)dx' = P{x' <x\I) , (11) 

where x ranges from Xa to x^. Since the probability P{xi < x < X2\I) can be expressed 
by the cdf 's as: 

rx2 

P{xi < X < X2\I) = I f{x\I)dx = F{x2)-F{xi), 

J xi 

the normalization (flUt implies: 

F{xt)= / f{x'\I)dx' = 1. 

J Xa 

Suppose we have a continuous variable y that is related functionally to a variable x by 
a one-to-one relation: 

X = x{y) ; y = y{x) , 

y being differentiable in x and vice versa. Let yi = y{xi) and y2 = y{x2)- Since the 
inferences about x and y are based on equivalent pieces of information I and /' (the 
transformations of the variables correspond only to relabellings of hypotheses), Desider- 
atum ///.c implies the following equality: 

«>« . (12, 
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where 



P(x G (xi, X2)|/) = / f{x\I)dx, 




P{v^{vx.y2)\i') = / f{y\i')dy 



(13) 



rvi 



P{yeiy2,yi)\I') = / fiylHdy 



with f{x\I) and f{y\I') being the pdf's for x and y, respectively. That is, assigned prob- 
abilities must be invariant under variate transformations. 

As an example that illustrates the above reasoning, imagine two scientists, say Mr. A and 
Mr. B, measuring decay times of unstable particles. Each time they start their clocks at 
the moment when a particle is produced, but the clocks run at different speeds, so Mr. A 
measures a decay time ti of the i-th particle, and Mr. B 



where a is an arbitrary positive constant. 

Since t is a continuous random variable, there is not much point in considering prob- 
abilities P{t = ti\I) and P{t = t'-\I), since both, t = ti and t = t[ are events with zero 
probability (probability measure). Instead, we should consider probabilities for measuring 
t in certain intervals (tj, ti+dt) and {t[, t'- + dt'), where dt and dt' are the widths of the two 
intervals. Due to the different speeds of the two clocks, equivalent events, i.e. equivalent 
time intervals, are not labelled equally by the two observers: the interval {ti, ti + dt) of 
Mr. A corresponds to the interval {t'^,ti + dt') = {ati, ati + cdt) of Mr. B. For example, in 
the case of a = 5, the interval no. 10 of Mr. A is split into intervals 46-50 by Mr. B. That 
is, the variate transformation (HH i implies 



Then, since the two propositions, t S {ti,ti + dt) and t' € {t'i,t^ + dt') differ only 
in labelling. Desideratum ///.c implies the two probabilities P(t G {ti,ti + dt)\l) and 



Note that the logic behind such a reasoning is very similar to the logic of Poincare's 
relativity principle lITTl of the special theory of relativity stating that no preferred inertial 
frame (or no absolute time scale) exists. 

The variate transformation (fT4t is linear, but it need not be so, as long as it remains 
one-to-one. Suppose that Mr. B considers a probability distribution of a variate 




(14) 



dt' = adt . 




y = \i\t. 

If Mr. A divides a range (0, 1] of his variate t into n intervals of equal widths 



(15) 



dt 



1 



n 
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Mr. B's n corresponding intervals dy of highly non-uniform widths 

1 



dy 



t 



dt 



-ydt 



cover the infinite corresponding range (— oo,0] of y. Despite the non-uniformity of the 
interval widths, the transformation (flSb still represents a mere relabelling of the hypothe- 
ses: t and y are equivalent variates, and t G {ti,ti + dt) and y G {yi,yi + dy) (with 
yi = In ti and dy = e^^' dt) are equivalent propositions. Imagine that because of using 
the transformed variate y instead of t, Mr. B considers himself inferior to Mr. A. But since 
the transformation (fTSl is one-to-one, the inverse transformation 

t = ey 

always exists: Mr. B can always obtain from and then make his inference from ti 
instead of from yi. 

To sum up, the logic behind Desideratum III.c implies equivalence of all variates, con- 
nected via one-to-one transformations: while the specified models (i.e. forms of pdf 's, see 
below) together with the range of the variates may be changed (this is indicated by using 
symbol /' instead of /), the probability content must be invariant under such transforma- 
tions. 

The equality ([T2b is assured for all intervals (a;i,a;2) and {y{xi),y{x2)) only if ( ifTll . 
pp. 20-28) 

f{y\I') = f{x\I) 



dy 



(16) 



Indeed: 



f{y\l')dy= / f{x\I) 
•Jyi 



yi 

X2 



dx 



dy 



P{ye{yuy2)\I') = 

= r f{x\I)dx = P{xe{x^,X2)\l) 

J Xl 

for variate transformations with positive dx/dy, and 



dy 



P{ye{y2,yi)\I') 



yi 



fW)dy 



y2 



yi 



1/2 



dx 



dy 



dy 



/ f{x\I)dx = P[xE{xi,X2)\l) 

J X2 



in the case of negative dx/dy. Inversely, /(a;|/) is expressed in terms of f{y\I) as 

f{x\I) = f{y\I') 
where \dy/dx \ is the reciprocal of \dx/dy\. 



dx 



(17) 



dy 




dx 


dx 




dy 
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since any one-to-one transformation from x to y and then from y to x must restore the 
original distribution, and hence 



dx 



dy 



dy 



dx 



Note that equal information implies equal probabilities, while in general it does not imply 
equality of pdf's. 

In the bivariate case where 



y 



X 



(1) 
(1) 



y 



X 



(2)=y(2)(a;{l)^3;(2)) 

(2) = x(2)(y«,y(2)) 



the relations between the pdf's for (x^^^ , x*^^^ 
are the following: 



and {y'^^\ 



fix 



(1) 3,(2) 



f{y^'\y^'^\I') = f{x^'\x^'^\I)\J\ 



'|/)and/(i/W,i/(2)|j), 



(18) 



and 



/(y«,y(^)|/')|J* 



(19) 



with the absolute values of the derivatives, \dx/dy\ and \dy/dx\, being replaced by the 
absolute values of the corresponding Jacobians, 



J 



and 



d{ym,y(^)] 



J* = J- 



dy(l)X^^^ dy{2)X^^^ 

dywx^"^^ 9y(2)x(2) 



respectively. 

Special attention is needed if the derivatives in a univariate case, or the Jacobians in a 
multivariate case, change sign within the domain of the pdf since in that case the variate 
transformations x ^ y and y ^ x are not one-to-one any more. We will meet such a 
difficulty in Section|4| where it will be overcome on account of the special symmetry of 
the specific transformation. 

In the present paper we consider sampling probability distributions, either discrete 
p{i\9I) or continuous p(x\9I) = f{x\6I)dx, that can be specified by a mathematical 
function, determined by the values of its parameters 9^. In such cases assignment of 
probabilities to hypotheses from a given set reduces to estimation of the parameters 9 of 
the distribution: what we try to achieve is to assign probabilities to different values of 
trhe parameters, i.e. to specify the probability distribution for 9. Note that the probability 
distributions for parameters are subjects to the same Desiderata as the distributions for 
sampling variates. In this paper we will focus on the parameters with dense domains, 
where the corresponding probability distributions are continuous: 



p{9\xj) = f{9\xil) d9 



(20) 



^We adhere to the common and useful convention of using Greek letters 9, fi, a, t, v and A for parame- 
ters throughout the paper 
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The probability for 9 (l20b is assigned upon information that we explicitly split in our 
notation into evidence xi from the measurement of the quantity x whose probability dis- 
tribution is determined by 9, and the additional relevant information /. The reason for 
such splitting of the information will become evident below when we derive B ayes' The- 
orem. 

The pdf for 9 (l20b is subject to the usual normalization, 

f{9'\xj)d9' = 1, (21) 



where integration is performed over the complete range of In addition, in the case of 
assigning probabilities for two parameters, 9 and u, simultaneously, the product rule (HI) 
can be applied: 

f{9u\xil) = f{9\xj) f{u\9xil) = f{u\xj) f{9\pxj) . (22) 

Then, with the factors f{u\9xil) and f{9\i'XiI) being properly normalized according to 
(|2T1) . it is easy to see that the marginalization procedure yields 



f{9'u\xj) d9' = fiiy\xj) , 
f{9u'\xj) du' = f{9\xj). 



(23) 



The product rule can also be applied for assigning probabilities to 9 and xi. 

p{9x^\I) = f{9\I) d9p{xi\9I) = p{x^\I) f{9\xj) d9 . (24) 
The above equation can be rewritten into Bayes' Theorem lfT3lfT4ll . 

also referred to as the principle of inverse probability (see [75], § 1.22, p. 28). When the 
domain of x is also dense, the theorem can be written in terms of pdf 's only: 

We interpret the theorem in the following way (p6|, § 1.3, p. 2). We are interested in 
the probability distribution for 9 and begin with the initial or prior probability, also re- 
ferred to as the probability a priori, whose pdf reads f{9\I). It is based on any additional 
information / that we possess beyond the immediate data Xi. Thus, ^(6*1/) = f{9\I)d9 
is the probability for 9 prior to taking evidence xi into account. The posterior probabil- 
ity, also referred to as the probability a posteriori, f{9\xil) d9, is the probability for 9 
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posterior to adding evidence xi to our previous information. The likelihood, p{xi\6I), 
tells us how likely xi is observed, given the value 9 of the parameter that determines the 
probability distribution for x. According to Bayes' Theorem (E51) . the only consistent 
way of obtaining the posterior pdf, f(9\xil), is by multiplying the prior pdf by the likeli- 
hood, which is usually referred to as the likelihood principle (see, for example, |2|, § 8.5, 
p. 250). Or, in the words of Jeffreys (| 15 1, § 2.0, p. 57): "Consequently the whole of the 
information contained in the observations that is relevant to the posterior probabilities of 
different hypotheses^, is summed up in the values that they give to the likelihood." Note 
that with the basic Desiderata being adopted, the term "likelihood principle" becomes in- 
appropriate and might even be misleading, since the fact that all of the information that 
can be extracted from the datum xi is contained in the value of the appropriate likelihood, 
is a mere consequence of application of the basic Desiderata, rather than an additional 
principle (i.e. a Desideratum) on its own. The denominators p{xi\I) and f{xi\I) can be 
obtained by the normalization requirement (ISTTl as: 

p{x,\i) = j f{e'\i)p{xr\e'i)de' {ii) 



and 



f{xi\i) = j f{e'\i)f{x,\e'i)de' . (28) 



Bayes' Theorem is thus a rule for updating the information that an inference is based 
upon. Formally, it is just a special case of the product rule of the Cox theorem. The 
latter also ensures that (l25t is the only consistent way of updating the information and, 
consequently, our probability distribution for 6. 

Suppose that after xi we learn a new piece of information, X2, that we would like to 
include in our inference about 6. Then, f{6\xil) serves as a prior pdf, i.e. pdf for 9 prior 
to taking X2 into account. According to (l25t . the posterior pdf then reads: 

f.Q^^^^^j. ^ f{e\xj)p{x2\exj) ^ f{e\i)p{xr\ei)p{x2\exj) ^^9) 

^ ^ p{xiX2\I) p{xi\I) p{xiX2\I) 

where p{x2\0xil) is the likelihood for X2 given 6, xi and the additional information /. 

In the limit of our complete ignorance about the value of a parameter 6 prior to the first 
evidence Xi, when I merely stands for our admission that we possess no prior informa- 
tion relevant to apart from the specified form of the sampling distribution, the complete 
procedure for manipulating probabilities by using the product rule (d), or Bayes' Theo- 
rem (l25t . breaks down. This is a direct consequence of the fact that we can only assign 
probabilities for hypotheses on the basis of available relevant information: ignorance I 
thus allows for no probability assignment f{0\I)- In this event, both the product rule (HJ) 
and its derivative, Bayes' Theorem (l25t . lack their vital components and cannot be used. 
In other words, Bayes' Theorem only allows for updating probabilities that were already 

^i.e. of different values of the inferred parameter(s) 
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assigned prior to their updating, and therefore need be amended for the limit of complete 
prior ignorance, which is a natural starting point for every sequential updating of infor- 
mation. Our goal in the following sections is to make such an amendment in a consistent 
way, i.e. to establish when and how probabilities can consistently be assigned. 

3 Complete ignorance about parameters 

As a starting point, we would like to specify precisely what we mean and what we do 
not mean by the limit of complete ignorance about the value of a parameter of a given 
probability distribution for x. First of all, ignorance about a distribution parameter is not 
a synonym for absolute ignorance in every possible respect. For example, throughout the 
paper we assume as a working hypothesis that the probability for x is distributed in a form 
that is completely known but for the value of its parameter(s) 9, i.e. the chosen form of 
the probability distribution for x together with its domain - the ranges of the sampling 
variate(s) and the inferred parameter(s), {xa,Xb) and {9a, 9b), is always assumed to be 
appropriate beyond the required precision. This assumption is explicitly indicated by the 
symbol / that every probability (or probability density) is conditioned upon. 

What we are completely ignorant about is the value of 9. There is no information at 
our disposal that would enable us to assign a probability distribution for 9: it can take any 
value within its permissible range {9a, 9b). Since the value of 9 is completely unknown, 
then the distribution for x becomes undetermined. This is where we then start collecting 
data that we would like to use for a consistent inference about 9. 

The situation, described above, is an ideal limiting case that can serve as a reasonable 
approximation for many real-life situations. For example, even before the first measure- 
ment of a decay time of an unknown unstable particle, there is not much room for doubt 
about the form of the decay time distribution. Due to past experiences with all other unsta- 
ble particles we feel almost completely certain that the distribution would be exponential 
(we come back to this point in S ectionfT6l where the possibility of assigning probabilities 
to specified models is considered). 

But before the first measurement, the parameter of the distribution, the average decay 
time r, is completely unknown, so we do not know what value of the first measure decay- 
time, ti, to expect: it could be anything between zero and infinity. Inversely, before the 
first measurement of t, the parameter r can take any value in the same interval. Note that 
a hint of a symmetry between the collected data and the inferred parameters is present in 
the foregoing reasoning. The concept will be extensively exploited during the following 
sections. 

With / representing only knowledge about the type of sampling distribution, different 
states of knowledge Ix = xl and Ig = 91 can be enumerated according to the values of x 
and 9, respectively. In this way, the sets and Xg of possible different states of knowledge 
become subsets of real numbers, Tx,Tq C M, and the pdf's f{x\9I) and f{9\xl) can both 
be formally expressed as functions 

f{x\9I), f{9\xl) : M X R > R . 
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Many of the derivations in the present article automatically require dense ranges for 
both X and 9, as well as differentiability of pdf's f{x\9I) and f{9\xil). The problems 
accompanying discrete sets of hypotheses are extensively discussed in Section [lH where 
inferences about parameters of counting experiments are considered. 

4 Location, scale and dispersion parameters 

We pay special attention to the so called location, scale, and dispersion parameters of 
probability distributions. A parameter of a sampling distribution is a location parameter, 
and a parameter a is a dispersion parameter, if the pdf for x takes the form 

f(x\fial) = -<p(^^) , (30) 



a \ a 



with the range for x stretching over the whole real axis, with the range of fi being an 
interval (/i^, /i^) on the real axis and with the permissible range of a being an interval 
((Ta, (Tfe) on the positive half of the real axis. For the time being, let the permissible range 
of /i coincide with the entire real axis, (/i^, /if,) = (— oo, cxo), and the range of a with 
its entire positive half, (aa,ab) = {0,oo), while we postpone a discussion about pre- 
constrained parameters until SectionfTTl 

A bivariate pdf for two independent variates, x'^^^ and x^'^\ both being subject to the 
same pdf of the form (l30b . according to the product rule ^ equals the product of univari- 
ate pdf's, /(a;(i)|/i^^) and f{x'^^^\fiaiy. 

f{x^^^x^^'>\fial) = f{x^^'>\fial)f{x^^^\x^^^i2al) 

= /(a:W|/xa/)/(x(2)|/ia/) ^3^^ 
1 , /x^^^ — /i\ ,(X^'^^ — /i^ 



The pdf for transformed variates x^^-* and x*-^-*, x and s, where 



X = 



s = 



> x^ ' 

-x«)/2 ;x(2)>x« 




or, inversely. 



^(1,2) 




^(1) ^ ^{2) 
^(2) ^ ^(1) 
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can be calculated according to (flSt : 

f{xs\fial) = 4/(xW(S,s)a;(2)(x,s)|/ia/) 




(32) 



1 ~(X — s 



The factor 4 in (l32b that ensures appropriate normalization, arises as a product of the 
absolute value of the Jacobian, 

9(x,s) "1+2 ;x(2)>x«, ^ ^ 

and an additional factor of 2, the latter being a consequence of the symmetry of the pdf 
for x^^^ and x^^-* with respect to the change of sign of the difference x^^^ — x^'^\ With the 
sign of the difference inverted, the determinant (l33t inverts its sign, too, but its absolute 
value, as well as the value of the pdf f[x^-^\x, s) x^'^\x, s)\fial^ (see (l32t '). remain un- 
changed. Therefore both the range with the positive and the range with the negative sign 
of the Jacobian can be simultaneously taken into account simply by multiplying the pdf 
f(^x^^\x, s) x^'^\x, s)|/icr/) by an additional factor of 2. 

When inferring the parameters of a sampling distribution of the form (BUt it may 
happen that the value of one of the two parameters is known to a high precision. In such 
cases the parameter with the precisely determined value is fixed and we only make an 
inference about the remaining one. Let first the dispersion parameter a be fixed to a^, e.g. 
to 1, so that the pdf for x, given the possible value of fi and the fixed value ctq, 

f{x\^,aoI) = -<p(^^)=Hx-fi), (34) 



(To ^ O"0 

is a function of x and /i only. The fixed parameter (ao in the present case) is usually 
(though not always) omitted from explicit expressions. 
According to and (fTTT) . the cdf for x reads: 

/(x'|/i(To/) dx' = / — /i) dx' = / (f){u) du = $(x — /i) . 

■oo J —oo J —oo 

(35) 

Note that the form $(x — /i) of the above cdf implies the corresponding pdf to be of the 
form (OH), i.e. implies /x to be a location parameter of a sampling probability distribution 
for X. Indeed: 

d Ou d 

f{xW) = -Fix, ,) = -,) = -$(.) - = -$(.) = , 
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where 

n = x — . 

Since the integrand in (l35l) . i.e. the pdf for x, is a positive function, and the upper bounds 
of the integral are strictly decreasing with the increase in the parameter, the cdf is evi- 
dently strictly decreasing in /i. 

With the location parameter being fixed to /io, say to 0, the pdf (l30h for x reduces to: 

/(a:|a/io/) = -^f^^) = -<!>(-) . (36) 
a \ a / a \a/ 

When the range of the random variate is bound to the positive half of the real axis, 

t e (0, oo) , 

the corresponding parameter r of the sampling distribution, 

f{t\TfioI) = , (37) 

is usually referred to as the scale parameter. Note that in symmetric cases when 

f{x - fioWfJ'oI) = f{-{x - fio)\crnol), 

the pdf (l36t for x with fixed fiQ can be reduced to a pdf (l37t without any loss of either 
generality or information. Namely, the pdf for a transformed variate 

t=\x-^Q\, 



ha) - ha 

a \aJ a \a 



reads: 

dx 

f{t\afioI) = '2f{x- fio\afioI) — 

where the factor of 2 arises due to shrinkage of the sample space. In such cases dispersion 
parameters are evidently equivalent to scale parameters. 

Any sampling probability distribution determined by a scale parameter, r, and a fixed 
location parameter fiQ, can be further transformed into a probability distribution deter- 
mined by a location parameter u (see, for example, ifTTll . §4.4, p. 144 or ifTSl . §3.2.2, 
pp. 22-23): 

p = Inr , 

and a fixed dispersion parameter Aq, say Aq = 1. Namely, a substitution 

z = lnt 

yields: 

f{z\uXoI) = /(t|r^o/) = e'-" He''") ^ ^(Z - v) . (38) 
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As was the case with cdf (l35t . the cdf for t, 

F{t, r, /io) = /(t'|r/io/) dt' = i " "^^""^ '^'^ ' ^^^^ 

is also monotonically decreasing with increasing value of the parameter. 

Some of the most important continuous sampling distributions are determined by one 
or more parameters of the above mentioned types (see 1 19|, § 4.2, pp. 58-83). In addition, 
all distributions determined by either location, scale or dispersion parameters share a very 
important property: they all belong to the invariant families of distributions . 



5 Invariant distributions 

Let 

f{x\ei) = <p{x,e) 

be a pdf for a random variable from a dense sample space X that is determined by the 
value of parameter from the parameter space 0. Let there exist a group Q of transfor- 
mations (7a G ^ of the sample space into itself: 



Qa ■■ X > X 



Qa : X > ga{x) = y , 

where index a denotes the particular element of the group. Since ^ is a group, it is closed 
under composition of transformation, i.e. a composition gc of every pair of transforma- 
tions ga, gb eG,gc = gtga, such that 

9cix) = gtgaix) = gb{gaix)) , 

is also contained in ^. In addition, the group also contains an identity g^ such that 

ge{x) = X , W X E X , 

and the inverse transformation g^^ for any ga such that 

9a^9a = 9a9a^ = 9e ■ 

As a consequence, the transformations ga are one-to-one, i.e. gaixi) = ga{x2) implies 
Xi = X2, and onto X, i.e. for every xi G X there exists an X2 E X such that ga{x2) = Xi 
(see, for example, [17|, §4.1, p. 143). 

Since the transformation y = ga{x) is one-to-one, the pdf for the transformed variate 
according to (fT6b and (fTTt reads: 



f{y\ei') = f{ga{x)\ei')=f{x\9I) 



dy 



dx 



-1 



{x,e)\gSx)\ ^ = 0(c//(y),^)|c/l(x)| \ 



(40) 
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In addition to Q, let there exist also a set Q of transformations ga of the parameter 
space into itself, 

ga : e > , 



ga ■■ > ga{9) = V . 



Then: 



j{y\vl') = f{ga{x)\UO)I') = <f>{x{y),e{i^))\g',{x)\-' 



(41) 



If for all X E X, and 9 E Q, and for every ga ^ Q there exists ga ^ G such that 

^{ga{x),gam = 4>{9a{x),ga{0)) , (42) 

the family of distributions f{x\9I) is said to be invariant under the group Q ( ifTTIl . § 4.1, 
p. 144 and El, § 23.10, pp. 300-301). 

If a family of distributions is invariant under Q, then the set Q of transformations of 6 
into itself is also a group, usually referred to as the induced group ([20|, § 23.10, p. 300). 
Namely, according to the definition of invariance, if the pdf for x is given by 9), the 
pdf for ga{x) is given by (l){ga{x),ga{9)) . Hence, the pdf for gb{ga{x)) = gbga{x) is given 
by both (p{gbi9aix)),gbigai9))) and (f){gbga)ix),'M^i9)). From the equality of the two it 
follows that 



9b9a = 9b9a ■ 

This shows that Q is closed under composition. It also shows that Q is closed under 
inverses if we let gb = ga^ ^^id note that g,, is the identity in Q. 

For example, a sampling distribution for x and s (l32b . determined by the values of a 
location parameter fi and a dispersion parameter a, is invariant under the group of simul- 
taneous location and scale transformations: 



9a,b 
9a,b 

where 



X > 9a,b{^) = ax + b 

s > Oabis) = as 

fi > ga,b{N = af^ + o 

o- > 9a,b{(^) = aa 



a E (0, oo) and b E {—oo, oo) . 

By fixing the dispersion parameter, the symmetry of the pdf with respect to the scale trans- 
formation is broken, leaving only the symmetry with respect to a simultaneous translation 
of X and /i by an arbitrary real number b: 

gb : X > gb{x) = x + b , 

(44) 

gb : /i > = n + b . 
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When, on the other hand, the location parameter is fixed to yUo = and the dispersion (or 
scale) of the distribution is unknown, the appropriate pdf f{x\a^oI) (l37t is still invariant 
under the scale transformation: 

ga : X > ga{x) = ax , 

(45) 

ga ■ o- > ga{cr) = aa . 

Let now f{x\9I) be an invariant sampling distribution and let F{x, 9) be its cdf such 
that 

j'X nx 

F{x,e)= / f{x'\ei)dx'= / (j){x',e)dx' , 



where Xa is the lower bound of the sample space. Then the cdf for y = ga{x), given 
^ = 9aid), reads: 

Fiy,u) = F{ga{x),ga{0)) = [' f{y'Wndy'= r^^\{ga{x'),ga{e))d{ga{x')) , 

Jya Jya 

where ya is the lower bound of the range of y. It is easy to show that: the lower and the 
upper bound of the range of x, Xa and Xf,, become transformed into the bounds of y, ya 
and yi,: 

\9a{xa) ; g'a{x) > { Qaixb) ] g'^{x) > 

ya = \ and yt = ^ ■, 

\9a{xh) ; ^^(a;) < \ga{xa) ; ^^(a;) < 



and that the cdf for y, given z/, is related to the cdf for x, given 6, as: 

F{y,u) = F{ga{x)MO)) = 



Im '^l?/'' ^«(^)) dy' = F{x, 9) ■ g'^ix) > 



i;:\fi<P{y',ga{0)) dy' = l- F{x,e) ■ g'^{x) < 



Indeed: 



F{ga{x),ga{d)) - F{ga{xa)r9a{e)) = ^ [y' \ga{e)l') dy' 

J ga{Xa) 

'^^^\{x\e)\g:{x')\-U{ga{x')) 

ga{Xa) 
PX 

= ± <p{x\e)dx' 

J Xa 

= ±Fix,e), 

where the positive and the negative sign correspond to g'aix) > and to g'aix) < 0, 
respectively. Setting x to the upper bound Xb of its range, the above equation reads: 

F{ga{Xb),ga{0)) - F{9a{Xa),9a{0)) = ±F{Xb,e) = ±1 . 
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Since the cdf's are limited within [0, 1], this completes the proof by implying 



F{ga{Xa),ga{0)) 







and F{gaixb),gaiO)) 



1 ; 9aix)>0 
; g'a{x)<0 



for every ga ^ G and jja G Q, and for all G 9. 

A very important corollary - the Existence Theorem - can be deduced from the above 
relations. Let a probability distribution for x be invariant under Q, and let Q and Q be 
continuous groups such that partial derivatives 



d_ 

da 



ga{x) and 



d_ 

da 



exist for every ga ^ G and ga E G, with both derivatives always being different from zero 
and finite. In other words, G and G are to be Lie groups (see, for example, [2JJ, § 7.1-7.2, 
pp. 126-130). In addition, let the range (xa, Xb) of the sampling variate also be invariant 
under G, i.e. 



9a[Xa) 



Xa ; 9aix) > 

; 9a{x) < 



and ga{xb) 

Then the cdf for y = ga{x), given u = (jaid), can be rewritten as: 
F 



Xb ; 9aix) > 

Xa ; g'aix) < 



F{x,e) ■,g'Ax)>0 
-F{x,9)- g',{x)<0 



which permits the following conclusions: the cdf F(^ga{x), ga{0)) is independent of the 
parameter a of the transformations, i.e. 



d_ 

da 



F[ga{x)r9a{e)) =0 



and the parameter a enters F[ga{x),ga{0)) only through ga{x) and ga{0), i.e. 

d d d 

—F{g,{x),ga{0)) = F^{g,{x),UO)) g^9a{x) + F2 —^0) , 

where F,; denotes differentiation with respect to the i-th argument of F (we adhere to this 
notation throughout the present paper, whatever the function and the arguments may be). 
Then, by combining the two conclusions and by setting a = e we obtain: 



9) k'{9) + F2ix, 9) h'{x) = 



(46) 



where the derivatives h'{x) and k'{9) of functions h{x) and k(9) are defined as reciprocals 
of the corresponding infinitesimal operators of the Lie groups G and G'- 



h'{x) = 



dh{x) 
dx 







d-a'^^""^ 


a=e 



-\ -1 



(47) 
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and 



k'ie) 

By defining a function G{x, 9), 



dk{e) 



d_ 
da 



-I -1 



G{x,e) = h{x)-k{9) , 

can be further reduced to 

6) G2{x, 6) - F2{x, 6) Gi{x, 6)=Q 
or to a functional determinant (see [22], § 7.2.1, p. 325), 



(48) 



(49) 



(50) 



F^{x,e) F2{x,e) 
G,{x,e) G2{x,e) 







With Gi{x, 6) = h'{x) being different from zero, 6) and ^2(0;, 6) can be expressed 
as 

Fi{x,e) = a{x,e)Gi{x,e) (51) 



and 



(52) 



F2{x,e) = (5{x,e)G2{x,e), 

which, inserted in (BOb . yield: 

0) ^2(3;, ^) ^) - ^)) = . 
Since this is to be true for any 9 and x, 9) and 9) must be the same functions: 

a{x,9) = (3{x,9) . 

Taking this into account, we multiply equations (BTT) and (l52t by (ix and c?^, respectively, 
so that their sum reads: 

dF{x,9) = a{x,9) dG{x,9) , 
implying the distribution function 9) to be a function of a single variable ^(a;, 9) 



F{x,9) = <t>{G{x,9)) = <^{h{x) - k{9)) . 



(53) 



By choosing 

z = h{x) and /i = /i:(6') , 
the cdf 6*) of an invariant sampling distribution thus reduces to 

F{x,9) = ^z-ii), 

which is a cdf of the variate z and a location parameter fi (c.f. eq. (1351 ). The above 
reasoning can be summarized as 
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Theorem 1 A sampling distribution for a continuous variate x, given a continuous pa- 
rameter 9, with both its form and its domain being invariant under a Lie group Q, is 
necessarily reducible (by separate transformations x z and 9 ^ ^) to a sampling 
distribution for z with the parameter being a location parameter 

For example, a pdf f{t\Tfj.Ql) for t &T\i with r being a scale parameter and with the 
location parameter fio being fixed to zero, is invariant under the group of scale transfor- 
mations (l45t . Then, according to (l47t and (l48t . h'(t) and k'{r) read: 

h'{t) = J and k\r) = ^ , 

which, in order to reduce the example to the problem inference about a location parameter 
/i, implies the appropriate transformations of the variate t and the parameter r: 

z = h(t) =lnt and /i = A;(r) = Inr . 

Indeed, this is in perfect agreement with equation (l38t of Section|4| 

In the following two sections we will see that the invariance of sampling distribu- 
tions is indispensable when constructing a logically consistent theory of inference about 
parameters. 



6 Consistency Theorem 

Suppose that before we received the first evidence, Xi, we had been completely ignorant 
about the value of the parameter 9 that determines the probability distribution for x. We 
had only known the type of sampling distribution for x and the permissible range (9a, 9^) 
of the parameter. Let the probability for x taking the value xi in a discrete case, or taking 
the value in the interval (xi, xi + dx) in a continuous case, be denoted by 9): 

p{xi\9I) = <!>{xi,9) . (54) 

In this section we prove the following proposition, henceforth referred to as the Consis- 
tency Theorem: 

Theorem 2 In order to meet the consistency Desideratum, the pdf for 9 based on xi only, 
must be directly proportional to the likelihood (1541 ). 

Proof. After having made the first observation, we know the type of sampling distribution 
and the values of xi. Therefore, the pdf for 9 given evidence xi will be proportional to a 
function 9), 

f{9\xil) = , (55) 

whose form we would like to determine. The denominator Ci^i) is just a normalization 
constant due to (1211) . 

C(xi) = / ^x,,9')d9' , (56) 
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and contains no information about 6. 

Let now X2 be another piece of evidence, independent of xi, that we would like to 
include in our inference about 6. Since X2 is independent of Xi, and subject to the same 
probability distribution (l54b as xi, its likelihood reads: 

p{x2\0XiI) = P{X2\9I) = $(X2, 9) . 

In the first section we saw that the only consistent way of updating pdf for 9 is the one in 
accordance with Bayes' Theorem ( l29l) . With f(9\xil) taking the role of the prior pdf for 
9, the pdf posterior to including X2 into our reasoning about 9 is written as: 

with the normalization constant C(a^i, ^2) being 

C(xi,X2) = J ^x^,9')^X2,9')d9' . 
9 

Nothing prevents us from reversing the order of taking the two pieces of information, 
xi and X2, into account, which results in the following pdf for 9: 

with the appropriate normalization constant ({^2, xi), 

C{x2,xi) = [ ^X2,9)^xu9)d9. 



Moreover, the consistency Desideratum Ill.a requires equality of the two results, (I57l and 
(EH): 

fi9\x^X2l) = f{9\x2Xj) , 



or _ _ 

$(Xi,^)$(x2,e) $(x2,e)$(Xi,( 



C{.Xl,X2) C{.X2,Xi) 

The ratio of eq. (l59t and its derivative with respect to 9 yields 

l'{xi,9) ^'{xi,9) _ ^'{X2.9) ^\X2,9) 
^xi,9) ^{xi,9) ~ ^X2,9) ^{x2,9) 

where we use the notation 



(59) 



(60) 
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Evidently, in order to ensure equality in (l60b for all possible values of xi and X2, the 
left and the right side of the equation must be independent of xi and X2, respectively, but 
can depend on the value of the parameter 9. 

Note that at this point, the two sides of equation d60t can, in principle, also depend on the 
values Xa, Xb, 9a and 9b, determining the admissible ranges of the sampling variate and 
of the parameter. However, in the following sections we will see that in all problems of 
parameter inference that can be consistently solved, there is no such explicit dependence. 

Taking this dependence into account by introducing a function h(9) we obtain 

^M{x,9) = -^\n^x,9) + h{9) , 

and, after integration of the latter, 

^x,9) = k7r{9)^x,9) , (61) 
where 7t{9) is a consistency factor, 

7i{9) = expjy , (62) 

and k an arbitrary integration constant. That is, the consistency factor is determined only 
up to an arbitrary constant factor. 

The consistency factor is differentiable by construction. From the form of n(9) (1^^ 
it is also obvious that if k is chosen to be positive, k-nif)) is positive for every 9 defined. 
Consequently, the normalization factor C(a;), 



((x) = k j n{9')<i>{x,9')d9' 



being an integral of a product of positive factors (1611) . is also a strictly positive quantity 
for every x defined. 

By inserting the solution (I6TT) into (l55t . we can finally write: 

f{9\xl) = 9) = ^pix\9I) (63) 

which completes the proof of the Consistency Theorem. 

The result is valid for p(x\9I) being the likelihood of either a discrete or dense variable x. 
For the latter, the Consistency Theorem can be rewritten by replacing the likelihood with 
the appropriate likelihood density, f{x\9I), 

f^Ol.,) ^ ,MOI) - 1^ /(.|«/) - ^ /(.|«/) . (64) 
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where r]{x) is the corresponding normalization factor, 

r]{x) = J n{9')f{xi\e'I)d9' . (65) 
e 

Evidently, the normalization constant is also determined up to a constant factor k, i.e., the 
consistency factor 7r(6') and the normalization factor rj{x) are determined up to the same 
factor. 

Similarly, in terms of posterior probabilities and likelihoods instead of the correspond- 
ing densities, the Theorem reads: 

The form of the Consistency Theorem (l66b reminds very much of that of Bayes' The- 
orem (l25t . In both cases, within a specified model, the complete information about the 
inferred parameter 6 of the model that can be extracted from a measurement x, is con- 
tained in the value of the appropriate likelihood, p{x\OI). But there is also a fundamental 
and very important difference between the two Theorems: while f{0\I) in Bayes' The- 
orem represents the pdf for 6 prior to including evidence x in our inference about 6, the 
consistency factor ixiO) in the Consistency Theorem is just a proportionality coefficient 
between the pdf for 6 and the likelihood function. The form of the factor depends on the 
only relevant information I that we possess before the first datum xi is collected, i.e. it 
depends on the specified sampling model. 

In Section[T51 we comment on how overlooking this difference led to a long-lasting 
confusion in plausible reasoning. Before that we show under what conditions the factors 
Tx{6) can be consistently determined, and uniquely determine them for such cases by 
following the basic Desiderata. 



7 Objective inference and equivalence of information 

According to the definition of probability adopted in the first section, every assigned prob- 
ability is necessarily subjective: no probability distribution can be assigned independently 
of the experience of the person who is expressing his or her degree of belief. In the words 
of Bruno deFinetti ( ll23ll . Preface, p. x): "Probability does not exist", meaning that there 
is no probability per se. For example, when there is not enough relevant information at 
our disposal, we are simply not in a position to make any probabilistic inferences. That 
is, even when lacking, information should never be confused with our hopes, fears, value 
judgments, etc. Since no matter how carefully these are specified, they still represent 
mere personal biases, prejudices and speculations. 

The general Desiderata represent the rules that we have to obey in order to preserve 
consistency of inference, so it is evident that the adjective subjective does not stand for 
arbitrary. In fact, in accordance with Desideratum III.c, our goal is that inferences are to 
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be completely objective in the sense that if in two problems the state of knowledge of a 
person making the inference is the same, then he or she must assign the same probabilities 
in both. The goal of the present and the following sections is to show when and how in- 
formation / about the specified model and its domain, within the framework of our basic 
Desiderata (i.e without any additional ad hoc assumptions), uniquely determine the form 
of consistency factors, the latter being indispensable at the starting point of any param- 
eter inference. Within the Desiderata, only Ill.a and III.c directly consider equalities of 
probabilities and can thus provide equations that could determine the form of consistency 
factors. Since the requirements of Desideratum Ill.a were extensively exploited already 
throughout the previous section when the Consistency Theorem was derived, it is only 
III.c that is left at our disposal to obtain the desired functional equation for ti{6). 
Let 

f{x\9i) = (t){x,e) 

be a sampling pdf for x whose parameter we would like to infer. We saw in the preceding 
section that in the case when this can be done in a consistent way, the pdf for 6 must take 
the form: 

f{e\xl) = ^ fix\ei) = ^ 9) , (67) 
where ■r]{x) is the usual normalization factor 

ri{x) = j 7r(^') f{x\e'I) dO' = j 7r(^') (/.(x, 9') dO' . (68) 
e e 

Equation (l67t with the unknown pdf f{6\xl) clearly does not determine uniquely the form 
of the consistency factors: as long as the normalization integral (l68t exists, n{6) can be 
any positive and differentiable function of 6. Additional constraints (functional equations) 
are therefore needed to reduce all these functions to a single consistent function, i.e. to 
the only function that is consistent with our Desiderata. 

In the case there exists a group Q of transformations Qa of the sample space such that 
y = ga{x), the above pdf for 6 can be expressed as 



where 



1-1 



f{y\9I')^<Pix,9)\g'^{x) 

Let there also exist a group Q of transformations (ja of the parameter space, u = ga{&). 
We saw already in Section|2lthat, according to objectivity Desideratum ///.c, the assigned 
probabilities must be invariant under variate transformations. This is assured if the pdf's 
of the original and the transformed variate, 6 and u, are related via (fT6b and (fTTt . so that 
the pdf for u, given a measured x, reads: 

f{v\xI') = f{9\xI)\-g'MV'- 
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That is, in order to assign equal probabilities in states of equal knowledge, the pdf for 
ga{0) must take the form: 

^^'^ (69) 



(f){9a{x),ga{0)) 



where: 

fiyli^n = (l){gaix),gaie)) ^ 0(,t, \g'^ix) 



-1 



^UO)) = Ha) (70) 

9a \ ) 



and, for g'^{9) > 0, 



r]{x) 



^{ga{x)) ^ Ha) = r^''\{UO')) f{ga{x)\UOV) dim) , (VI) 

while for g'^id) < the limits of the above integral are to be interchanged"^. Note that 
in general the value of the multiplication constant k, up to which the consistency and the 
normalization factors, (l62b and (l65t . are to be uniquely determined (recall the preceding 
section), may depend on the value of the transformation parameter a. 

Equation ( l69b represents a constraint on consistency factors that is additional to (l67t . 
but it also introduces an additional unknown variable, function Ti{ga{0))- For invariant 
sampling distributions, however, it is easy to demonstrate that the form of the consistency 
factor must also be invariant under the induced group Q, i.e. that vr and tt must be the same 
functions. Namely, the forms of consistency factors n{9) andn^gaiO)) depend on infor- 
mation I and I' that we possess about 6 and ga{G), respectively, prior to collecting datum 
x: on the forms and domains of sampling distributions, 0(a;, 6) and (l)(^ga{x),ga{9)). In 
the case where all of these are invariant under particular transformations ga and ga.' 



a) 
b) 



<l>{9a{x),ga{0)) = (f){ga{x),ga{6)) 



Xa ; g'^{x) > „ f^-s_ j^b ; g'aix) > 

Xh ; 9a\^) < \Xa ] ga{x) < 



and 



'^Index a in ga and ga denotes particular elements of transformation groups, while in Xa and 9a it 
indicates the lower bounds of the sample and parameter space, respectively. 
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c) 





the information I equals information I' and the two consistency factors, n{9) and n 
must be the same functions: 




The above functional equation for 7r(^^) is the cornerstone of the entire theory of consistent 
assignment of probabilities to parameters of sampling distributions: 

Conjecture 1 Equation (1731) is the only functional equation within the basic Desiderata 
that can be used for determination of consistency factors. 

The proof of this conjecture represents an open problem that is still to be solved. It is a 
serious problem, though, since any additional functional equations for vr(6'), independent 
of (1731) . with solutions different from solutions of (iTSt . would most seriously jeopardize 
the consistency of the entire probabilistic approach to inferences about the parameters of 
sampling distributions. 

Be that as it may, equation (1731 is the only functional equation that we know of that 
can be used for determination of consistency factors if we want to rely exclusively on our 
basic Desiderata. All other procedures for determination of Tr{9) (of non-informative prior 
'probability' distributions; see SectionfTSt that we are aware of involve applications of 
some additional ad hoc assumptions so that there is absolutely no guarantee that reason- 
ings of such a kind be consistent. Further arguments and examples, supporting the above 
conjecture by exhibiting the decisive role of the invariance of sampling distributions un- 
der group transformations in the process of determination of the consistency factors, are 
presented in Sections [TTl [121 [Ml and [151 and in Appendix IbI 

In case of a two-parametric induced group Q of parameter transformations ga,b, the 
functional equation for the consistency factor n(9^^\ 9^"^^) for the inferred parameters 9^^^ 
and 9^'^^ reads: 



We will come across functional equation (l74l) in SectionlTOl during a simultaneous infer- 
ence about a location and a scale parameter. 




(74) 



where J stands for the appropriate Jacobian 



J = 



(0(^))) 
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When equation (l72t holds the normalization factor r](^ga{x)) equals r](^ga{x)), 
v{9aix)) = / n{gai9')) (I){gaix),gai6')) d{gai9')) = v{9aix)) , 

so that 

k{a)r]{x) = r]{ga{x)) . (75) 

Under what circumstances does a unique solution of the functional equation (1731 ex- 
ist? Let us consider the problem with a sampling distribution being invariant under a 
discrete group of transformations 

ga : X > ga{x) = ax , 

(76) 

9a ■ ^ gaiO) = a9 , 

where a can only take two values, 

a = {l,-l} 

for both groups, G and Q. That is, the considered distribution possesses parity under 
simultaneous inversion of sampling and parameter space coordinates. Then, for a = —1, 
functional equation (1731 reads: 

Tr{-e) = k{a = -l)TT{e) , 

or, after an inversion 9 < — > 

-n{e) = k{a = ~l)7T{-e) . 
Multiplying the two equations yields: 

A;2(a = -1) = 1 , 

which, when the convention about vr being positive is invoked, further implies 

n{-e) = n{e) . (77) 

That is, the consistency factor that corresponds to a sampling distribution being invariant 
under simultaneous inversions of sampling and parameter space coordinates, must itself 
possess parity under inversion of parameter space coordinates. But apart from this it can 
take any form and so in this case the solution of (iTSt is clearly not unique. It is not 
difficult to understand that this is a common feature of all solutions based on invariance 
of the sampling distributions under discrete groups. If the symmetry group is discrete, 
the sample and the parameter spaces break up in intervals with no connections in terms of 
group transformations within the points of the same interval. We are then free to choose 
the form of n(9) in one of these intervals (e.g. we can choose n{9) for the positive values 
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of 6 in the above example), so it is evident that it is impossible to determine uniquely the 
form of consistency factors for problems that are invariant only under discrete groups of 
transformations. 

It turns out, however, that for sampling distributions that are invariant under Lie 
groups, functional equation (1731) uniquely determines the form of the corresponding con- 
sistency factors . But according to the Existence Theorem of Section|5l the invariance of 
a sampling distribution under a Lie group is found only when the parameter of the dis- 
tribution is reducible to a location parameter by one-to-one transformations of both the 
parameter and the sampling variate. It is therefore sufficient to determine the form of 
consistency factors for location parameters, which is accomplished in the following three 
sections. 



8 Location parameters 

We saw in Section|5]that a sampling distribution, parameterized by a location parameter 
yU and by a fixed dispersion parameter ctq, is invariant under the group of translations (l44b . 
In such a case the functional equation (1731 for 7r(yu) reads: 

7r(/i + h) = k{h) 7r(/x) . (78) 
After its differentiation with respect to 6, 

7r'(/i + h) = k'{h) 7r(yu) , 
we set 6 = and obtain a simple differential equation with separable variables tt and /i, 

— = —qdiJ, , (79) 

TT 

with the constant q being defined as 

q = ~k'{0) . 

The general solution of (l79b reads 

7r(/i) = exp{-g/i} , 

where is an integration constant. Since all multiplication constants can be put into 
1], we can assume without any loss of generality that C-^ = 1, obtaining in this way the 
general form of the consistency factor for location parameters: 

7r(/i) = exp {-g/i} . (80) 

For sampling distributions, symmetric under simultaneous inversions of the sampling and 
the parameter space, equation dTTl implies q = 0, i.e. implies uniform consistency fac- 
tors for location parameters. By invoking the symmetry of the problems of simultaneous 



32 



inference about a location and a scale parameter in Sectional we show that further appli- 
cations of the basic Desiderata and their direct implications also require g = in the case 
of problems without space-inversion symmetry. 

Based on a measured value xi, the pdf for a location parameter yu therefore reads: 

f{n\xiaoI) = —-^ f{xi\fiaoI) = —(pi . (81) 

Now, as an example, we want to update our inference about the parameter /i by including 
additional information X2 in our inference, where X2 is a result of a measurement of x that 
is also subject to the same sampling distribution and independent of xi. We can write the 
likelihood density for X2, 

(To V (To 

and the updated pdf for fi, 

f{fi\xiaoI) f{x2\fiaoI) 



f{x2\ficroXiI) = /(x2|/xcro/) = -) , (82) 

(Tn V (Tn / 



/(/i|xiX2(To/) 



r]{Xi,X2 



f{xi\iJ,aoI) f{x2\fiaoI) 
r]{xi,X2) 

e~'^^ 1 fxi — fi\ f'X2 — fi 



r]{xi,X2) (To ^ V (To J (To 
/(XiX2|/i(To/) 



(83) 



T]{Xi,X2 

with the appropriate normalization constant, ri{xi^ X2), being: 

/oo 
e-'^^' fix,X2\fi'croI)dfi' . (84) 
-00 

The update (l83t is made in accordance with Bayes' theorem (l29l) with the purpose of 
ensuring our reasoning be consistent. 

The product of the likelihood densities /(xi|;Ucro/) and /(x2|/U(To/) in ( 15^ is equal 
to the combined likelihood density for the two independent events, Xi and X2, due to the 
product rule Q. According to (l32t . the likelihood density /(xiX2|/i(To/) can equivalently 
be represented by the density for x and s, f{xs\fj.aoI), where 

X1 + X2 \xi-X2\ ,0^, 

X = , s = . (85) 

Written in terms of f{n\aoxsI) the pdf for fi (l83t thus reads: 

f{n\aoXsI) = — — — /(xs|/i(To/) = 

(86) 

e '^^ A fx — ji s \ (X — ji s ' 



T]{x,s) al^\ (To ^aoJ'^y (Tq ctq 
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with the appropriate normalization constant ri{x, s), 

/oo 
e-^^' f{xs\fi'aQl)d^i' . (87) 
-oo 

The findings of the present example will become of particular importance in Sec- 
tionflOl where we determine the form of the consistency factor 7r(yU, a) for simultaneous 
estimation of a location and a dispersion parameter. 

9 Inference about scale and dispersion parameters 

When, contrary to the preceding section, the inferred dispersion (or scale) parameter is 
unknown and the location parameter is fixed to /i = /iq, the problem is invariant under the 
group (l45t of scale transformations. In such a case the functional equation (l73t for 7r(cr) 
reads: 

7r(a(T) = /i(a)7r((T) , (88) 

where 

_ k{a) 

h(a) = . 

a 

Equation (l88t determines the form of the consistency factor for dispersion and scale pa- 
rameters to be 

Tr{a) = a-' (89) 



and 



where the value of the constant r. 



7r(r) = (J-'' , (90) 



r = -h'{l) , 

is yet to be determined in Section flOl 

We stressed in Sectional (c.f. equation that an assignment of a pdf to a scale 
parameter r (or, equivalently, to a dispersion parameter cr of a symmetric distribution) 
can be reduced to an assignment of a pdf, f{iy\zXoI), to a location parameter, u: 

f{u\zXoI) = ^f{z\u\oI) 
r]{z) 

with 

z = \nt and u = \nT , 

and with Aq being a fixed dispersion parameter. According to the findings of the previous 
section (see eq. (l80b ). we can immediately write the appropriate consistency factor: 
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Making use of eq. dlTb . the pdf for v can be transformed into tlie pdf for r: 

/(r|t/xo/) = j{v\z\l) 

where 



dr 



(91) 



dr 

On the other hand, in order to make a consistent inference, the pdf for r, f{r\tiJ,oI), based 
on t only, must be proportional to the likelihood density f(t\THoI) (see eq. (l64b ): 

/(r|t^o/) = ^/(t|r/io/) • (92) 

Then, due to Desideratum Ill.a, the two pdf 's, (l9TT) and (l92b . must be equal, which implies 
the form of the consistency factors for scale parameters, 

7r(r) = r-('?+^) , (93) 

as well as for dispersion parameters, 

7c{a) = . (94) 

The same Desideratum implies equality of the factors (l89b and (l94b . as well as equality of 
the factors (l90l) and (l93l) . i.e. implies the relation 

r = g + 1 (95) 

between the parameters q and r of the consistency factors of the location and scale pa- 
rameters. Evidently, if q is determined to be zero, this would immediately imply r = 1. 

In the limit of complete prior ignorance about its value, the pdf for a given a measured 
value xi and the fixed value of /i = yUo therefore reads: 

f{a\noXiI) = /(sil/ioO-J) . 

Following the steps of the example of the preceding section, we update the pdf for a by 
including result X2 of an additional measurement in our inference. The updated value of 
the pdf reads: 

/((j|/ioXiX2/) = .^^^^ . /(XiX2|/i0CT/) 
?7(Xi,X2) 

-f{xiX2\fiocrI) 



r]{xi,X2 

Cr^'" 1 /Xi - /io\ , /X2 - /io 



ri{xi,X2) cr^ V a J \ a 
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with the nomialization constant //(xi, X2), 

r]{xi,X2)= / {a'y f{xiX2\noo-'I) da . (96) 

In an analogy with (l86b and dSTl . both the updated pdf for a and the corresponding nor- 
malization constant can be expressed in terms of x and s (l85t . instead of xi and X2: 

/(cr|/io:rsJ) = ——— f{xs\fioaI) = 

Vix,s) ^^^^ 

a " A /x — jJiQ s\ /x—fiQ s' 



+ - 



ri{x, s) a'^ \ a aJ \ a a 

and 

rj{x,s) 



poo 

/ {a')-'' f{xs\^i,a'I)da' . (98) 

JQ 

10 Simultaneous inference about a location and a disper- 
sion parameter 

By fixing neither the location nor the dispersion parameter, an inference about the two 
parameters is invariant under a simultaneous location and scale transformation (l43t . The 
symmetry of the problem implies the following form of the functional equation (l74l) for 
the appropriate consistency and normalization factors: 

7r(a/i + 6, aa) = h{a, h) ir^fj,, a) , (99) 

where 

h{a, b) = k{a, b) 



d{afi + b, aa) 



k{a, b) 



a2 



(9(/i,a) 

In order to solve it, we differentiate equation (l99b separately with respect to a and b, 
set afterward a = 1 and = 0, and obtain: 

/i7ri(/i, a) + a7r2(/i, a) = -f7r(/i, a), (100) 
7ri(/i,a) = -q7r{iJ,,a) , (101) 

with the constants q and f being defined as 

g = -/i2(l,0) and f = -/ii(l,0). 

The general solution of differential functional equation dlOll l is a function 7r(yU, cr) of 
the form 

7r(/i, cr) = i7(cr) exp {— g/i} , (102) 
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where ^l(a) is a non-negative function of a. When inserted in (1 1001) . (11021) yields: 

which, if it is to be true for all /i and a, further implies q = and 

Q{a) oc a~'^ , 

so that the general form of the consistency factor 7i{fi, a) reads: 

TT{fi,a) = a-\ (103) 
where we put all possible multiplication constants in the normalization factor: 

/oo /»oo 
dfi' / da' [a'Y^ f{xs\^ a I) . (104) 
■oo Jo 

Now there is only one step dividing us from a complete determination of consistency 
factors for location, scale and dispersion parameters. Having established the form (11031) 
of the consistency factor 7r(yU, a), we can write down a pdf for /i and a given x and s: 

f{^a\xsl) = 4^ = f{xs\^ial) . (105) 

ri[x, s) ri[x, s) 

According to the product rule the pdf can also be written as 

f{fia\xsl) = f{fi\axsl) f{a\xsl) , (106) 
where f(a\xsl) is a marginal pdf (see equation (E^): 

/oo /'OO 
f{^'a\xsl)d^'= —— — - / f{xs\ii'al)d^' . (107) 
-oo Vy^i ^) J ~oo 

Then, combination of equations (I86H87I) and (I105H107I) leads to: 



JZ,e-<^^' f{xs\fi'al)dfj.' ' 
solvable for any value of fi if and only if 

g = , (108) 

where q is the constant of the consistency factor for location parameters dSOb . Due to 
the simple relation ( 1^51 between the constant q of the consistency factor for the location 
parameters and the constant r of the factors for the scale and dispersion parameters, the 
above solution also implies 

r = 1 . (109) 
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These are highly nontrivial resuks since they uniquely determine the consistency factors 
for the location, dispersion and scale parameters (recall eqns. (l80b . (l89t and (lOOt): 



= 1 , 

7r((T) = a-\ (110) 



Trfr) = r 



In addition, in order to determine also the value of the constant f of the consistency 
factor (11031) . we recall that apart from (11061) . the product rule ^ also allows for the pdf 
dlOSI) to be written as: 



f{lia\xsl) = f{a\fixsl) f{ii\xsl) , (111) 
where the marginal distribution f{n\xsl) now stands for 

/(/ilxsJ) = / fi^a'lxsl) da' = —— / (aT' f{xs\fia'I) da' , (112) 
Jo V{x,s) Jo 

while the pdf for a given /i, f{a\iixsl), equals the pdf ( l97l : 

f{a\^ixsl) = .oo. ,._^rf- I — fi^s\nal) . (113) 
Jo (^) '^f[xs\na'I)da' 

Equations (11051) and (I111H113I) combined yield: 

j^{a')-^-f{xsWl)da' 



/o°°(f^')""/(^s|/ia'/) da' 



with r being determined (11091) to be 1 . Evidently, the solution of the above equation reads: 

f = r = 1 , 

which finally determines the consistency factor 7r(/i, a) for a symmetric sampling distri- 
bution (see eq. (11031) ). 

Ti{li,a) = a-^ . (114) 

Now we would like to make use of results obtained in this and preceding sections and 
address the so-called problem of two means (i.e. two location parameters), also referred to 
as the Fisher-Behrens problem (see refs. |24| and |25|, and § 19.47, pp. 160-162, § 19.48, 
p. 164 and §26.28-26.29, pp. 441-442 in ref. [20|). Imagine x and y being independent 
quantities, both being subject to Gaussian sampling distributions, 

/(x|/xicri/) = -— exp<' 



and 

J[y\fi2cr2l) = ^ exp<j 
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with parameters of both distributions being unknown and unconstrained. After collecting 
two events, (xi,X2) and (yi,y2)> from each of the two samples, we are in a position to 
make a probabilistic inference about the unknown parameters. The pdf's for (/ii, cji) and 
(//2,o-2) read: 



f{fiiai\xsxl) 



and 



f{fj'2cr2\ysyl) 



7r(/ii,CJi) 

7, 3 exp 

27r af 



3 exp 



f{xsx\fiiail) 



{x- m)' 



+ 1 



7r(j»2,tT2) 

viy,sy) 

4 S„ 



%■ exp 



f{ysy\n2cr2l) 



(Jo 



(Jo 



4 Sy 

exp 



iy-f^2)' 



+ 1 
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where 



and 



Xi + X2 

2 

yi + 2/2 



Fl - X2\ 
2 

|yi - 2/2! 



Since the two pdf's are independent, we apply the product rule ^ and write down a 
pdf for Hi, fi2, (Ji and a2 as a product of pdf's ( 11151 and ( 11161 ): 

fifJ.lfJ-2(Jl(J2\xySxSyI) = f{tJ-l(Jl\xSxI) f{fJ-2(J2\ySyI) = 



4 Sx Sy 

9 ^ q 6Xp 

TT^ erf 



(J^ 



A Sx Sy ^ 

Z9 ~3 ~iT ^^Pi I2 



TT 0"i (Tn 



erf 



si 



(Jo 



+ 1 



exp 



(Jt 



exp<^ 1 

(Jo 



{y - ^^2? 



+ 1 



By integrating out parameters cji and (J2 we find the marginal pdf for ^1 and ^2 to be of 
the form: 



/"OO /"OO 

f{t^lt^2\xySxSyI) = / dfJi / d(J2fifJ-ifJ-2(Ji(J2\ 

Jo Jo 



Tt'^ SxSy 



S2 



+ 1 



xysxSyl) 
{y-^^2f 



+ 1 



Note that the above results were all obtained simply by using some of the applications of 
basic Desiderata without making any additional assumptions or requiring any new postu- 
lates (compare to references 1,26.] and I.20J . § 26.29, p. 442). We refer to the Fisher-Behrens 
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problem again in Section^] when we comment on difficulties with inferences about pa- 
rameters outside the framework of probability. 

11 On uniqueness of consistency factors and on consis- 
tency of basic rules 

In previous sections we saw that the general rules for plausible reasoning - the Cox- 
Polya-Jaynes Desiderata - uniquely determine the consistency factors for location, scale 
and dispersion parameters. In other words, in the limit of complete prior ignorance, there 
is only one possible way of making a consistent inference about the three types of pa- 
rameters. According to (l63t . the pdf, assigned to one or to several of these parameters 
simultaneously, is to be proportional to the likelihood function, containing the informa- 
tion from one or several measured events Xi that are subject to the distribution determined 
by the inferred parameter(s), and to the appropriate consistency factor as determined at 
the end of the preceding section. 

Anyone who possesses the same information, but assigns a different probability dis- 
tribution to a given parameter , e.g. by choosing a 'consistency factor' of a different form, 
thus necessarily violates at least one of our basic Desiderata. It is certainly true that no- 
body has the authority to forbid such violations, but, at the same time, it is also true that 
anyone coming to the conclusions by violating such basic and general rules being would 
surely have difficulties in persuading anyone else, who was aware of these violations, to 
accept his conclusions. 

In Section|7]we stressed that a prior information / in a problem of inference about a 
parameter of a sampling distribution, f{x\9I), is equal to corresponding information 
/' about an inferred parameter v = ga{d) of the sampling distribution for y = ga{x), 
f[ga{x)\ga{9)I'), only if three requirements are simultaneously met: a) if the sampling 
distribution is invariant under transformations ga and ga, b) if the permissible range of 
sampling variate x, {xa, Xb) is invariant under ga, and c) if the permissible range of the in- 
ferred parameter 9, {9a, 9b) is invariant under transformation ga- Evidently, for the ranges 
of parameters = (—00,00) and {aa,(Jb) = {Ta,Tb) = (0, 00), and the corre- 

sponding transformations (I43H45I) . the condition c) is well fulfilled for all a e (0, 00) and 
b G (—00, 00). 

But, in practice, we usually face problems with pre-constrained inferred parameters: 
we possess some additional information that narrows the admissible range. As a simple 
example, when we are estimating the average lifetime of a newly discovered particle, 
produced in an experiment with highly energetic protons from an accelerator hitting a 
fixed target, it is easy to imagine that r cannot really be infinite, for in that case there 
should be many of these particles around as remnants of the Big Bang. In fact, it might 
well be reasoned that r is necessarily even much smaller than billions of years, since 
in case of r being sufficiently large, e.g. of the order of a millisecond, we should have 
noticed the particles as products of cosmic protons hitting nuclei in the upper layers of 
the Earth's atmosphere. The listed two arguments, as well as any possible additional 
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ones, thus lead to a finite interval (tq, rt) in the positive half of the real axis. But such 
an interval is clearly not invariant under parameter transformation ga{T~) = ar and so the 
above condition c) for the equality of information, / = /', is not fulfilled. How shall we 
proceed in such cases? 

Suppose for a moment that we ignore the fact that the condition c) is not fulfilled. 
Effectively this is equivalent to a prescription that can often be found in textbooks on the 
so-called Bayesian inference: to use the same function 7r(r) as in the case of no constraints 
on the parameter range, and afterwards to chop-off the unconstrained pdf for the inferred 
parameter, r, outside the interval (tq, ti,) and renormalize the truncated distribution (see, 
for example, [16|, §3.17, pp. 72-73). It is easy to show that in general such an ad hoc 
prescription inevitably leads to inconsistencies. 

Namely, without the constraints on r, n(ar) in equation (ITOb is to equal 7r(ar), and 
consequently 7r(r) is to equal 1/r. Then, (ITOb implies 

k{a) = 1 , V a G (0, oo) , 

so the functional equation dTSt for the normalization factor r]{ti) reads: 

?j{at^) = ^. (117) 
a 

By setting a = 1 we realize that rj(t) and r](t) are to be the same functions, 

v{h) = v{h), (118) 

or, equivalently, 

Viah) = 7]{ah) . (119) 
When inserted in (I117I) . (11191 ) yields 

r]{at,) = ^ , (120) 
a 

which must be true for any a E (0, oo). For a = (11201) thus reads: 

r^(t,) = r^il) ^ . (121) 
On the other hand, by definition (1711) . rj{ti) should ensure normalization of the pdf for ar: 

?j{h)= f\r')f{t,\T'l)dT' = ri^e^^rfr' = i(e-*^/-^-e-*i/-). (122) 

Evidently, the equality (II 181) is assured for all ti E (0, oo) only if 

Ta = and Tb = oo , 
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T [ps] T [ps] 

Figure 1 : a) Unconstrained probability density function for r based on a recorded decay 
time ti = 5 ps. The hatched areas represent integrals, ei = 0.189 and 62 = 0.465, of 
the pdf in the intervals (0, = 3 ps) and (r^ = 8 ps, 00), respectively, b) Pdf for r 
based on an average t = 5psofn = 10 recorded decay times with the integrals of 
the unconstrained pdf outside the allowed region for r being reduced to ei = 0.031 and 
62 = 0.102. 



i.e. only if the invariance of the parameter range under the particular group Q of transfor- 
mations cja is exact. 

In practice, our reasoning would still be sufficiently consistent if 

— > 1 and - < 1 , (123) 

i.e. if the integrals ei and £2 of the unconstrained pdf for r outside the allowed region (see 
Figure 1), 




(124) 



are sufficiently small. That is, if the conditions (11231) are fulfilled, normalization factors 
(11211) and (11221) are equal beyond the precision required for the particular inference. As 
admitted above, the invariance of the admissible parameter range, and consequently also 
the consistency of our reasoning, is not exact in such a case since for extremely large or 
extremely small values of measured decay times t the equality (II 171) would not hold. But 
once ti has been recorded, it is very likely that the value of the inferred r would also be of 
the same order of magnitude, i.e. it would be extremely unlikely for r to be much smaller 
or much larger than ti. Then, with r ~ ti, it is also very unlikely that we would ever 
observe an event t orders of magnitude different from ti. 
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On the other hand, if the conditions (11231) are not fulfilled (see, for example. Fig- 
ure 1 .a), our reasoning is clearly not consistent: by applying the two equally valid normal- 
ization factors, (11211) and (11221) . we are able to arrive at significantly different probabilities 
for the same proposition, e.g. P {ar E (ti , | ati l) , which is in direct contradiction with 
the consistency Desideratum III. a. The additional information that narrows the interval 
{Ta,Tb) niay still be very useful, but it is just that consistent probabilistic reasoning is 
impossible in such a situation. 

In order to avoid such an inconsistency, we could have ignored the information, ad- 
ditional to ti, and stretched (tq, Th) to the whole positive half of the real axis. But this is 
not a solution either, since in this way we would have been arbitrarily ignoring some of 
the available information and basing our conclusions on what remains. Acting in explicit 
contradiction with Desideratum Ill.b, we would have allowed ideology to break into our 
inference, which is inadmissible for any scientifically respectable reasoning. 

Thus, the only consistent solution to our problem of inference about the pre-con- 
strained parameter r would be provided by recording additional independent decay times, 
t2, h, ...,tn, of particles of the same type. Then, the unconstrained probability distribution 
for r, based on the recorded data, is described by the following pdf: 

fir\hh...tj) = f{r\tnl) = _1_ e--> , 

where t is an average of the recorded decay times: 

1 ,A 



t = 

n 



i=l 



The distribution narrows as n increases (see Figure l.b). Therefore, by collecting enough 
data, the inconsistency is diminished beyond the required level: by diminishing the inte- 
grals ei and 62 of the unconstrained pdf for r outside the constrained domain, the limits 
Ta and Tb that caused the inconsistencies become irrelevant. 

Our theory of consistent inference about parameters is therefore valid only if certain 
conditions are fulfilled, i.e. in the limit e± 0. Such a theory is referred to as an effective 
theory, with the term coming from physics where all theories are effective. The precision 
of predictions of an effective theory is estimated by the proximity of the actual conditions 
to the ideal limit. The values of integrals e± (11241) . compared to zero, can thus serve 
as an estimate of the precision of our probabilistic inference about the pre-constrained 
parameters. 

12 Calibration 

The most striking achievement of the physical sciences is prediction. 

Georg Polya ([4|, Chapter XIV, § 4, p. 64) 
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Thus far our theory of plausible inference about parameters has been developed by 
following Desiderata /.-///. only, while the implications of the operational Desideratum 
have not yet been considered. According to the latter Desideratum, in order to exceed the 
level of a mere speculation, our theory of inference about parameters must be exposed, 
i.e. must be able to make predictions that can be verified by experiments. The magni- 
tudes considered by physicists such as mass, electric charge or reaction velocity have an 
operational definition: the physicist knows very well which operations he or she has to 
perform if he or she wishes to ascertain the magnitude of an electric charge, for example 
(El, Chapter XV, §4, p. 117). Is there a way for probability as a measure of a personal 
degree of belief to become operational, too? 

Imagine we were given several numbers Xi, all produced by a random number gen- 
erator according to a distribution f{xi\9il). While the form of the distribution is known 
to us, we do not know much about the values 9i of the parameter: in general they can be 
different for each of the numbers Xi generated and can be any number in the range {9a, 9^). 

In fact, what we are asked for is to make probabilistic inferences about the unknown 
values 9i on the basis of each datum Xi separately. Based on our probability distribution 
for 9i, p{9i\xil) = f{9i\xil), we should specify our confidence intervals {9i^i, 9i^2) such 
that ^ 

P(^ie(^,,i,Mk^^)= [^'' f{9^\xJ)d9i = S, (125) 

with 5 being the same for all inferences. Note that the interval for the inference of a 
particular value 9i is not unique: it can be the shortest of all possible intervals, the central 
interval with P(9i < 6'j i|xj/) = P{9i > 9i 2\xil) = (1 — S)/2, the lower-most interval 
with 1 = 9a, the upper-most interval with 6'j 2 = 9^, or any other interval as long as the 
probability (11251 ) equals 5. 

After the inference has been made, we learn the values 9i of the parameter used in the 
random number generator. Then, our probability judgments are said to be calibrated if 
they agree with the actual frequencies of occurrence ([ 16 1, § 6.4, p. 142), i.e. if the fraction 
of inferences with the specified intervals containing the actual value of the parameter for 
the particular example, coincides with 5. 

The definition of long range relative frequency, although in some way less distinct than 
that of an electric charge, is still operational: it suggests definite operations that we can 
undertake to obtain an approximate numerical value of such a frequency ([4|, Chapter XV, 
§ 4,p. 1 17). Our theory thus cannot be expected to give a correct prediction each time, but 
it can be verifiable from its long range consequences, i.e. it can reasonably be expected 
to give the right answer in an assignable percentage of cases in the long run. 

Under what conditions is a calibrated inference achieved? To answer this question we 
refer to the construction of classical confidence intervals (see ll27l . § 9.2.1, pp. 200-201 in 
©1, and Chapter 19 in [20|) and to fiducial theory (see refs. E) and |l20|, § 19.44-19-47, 
pp. 156-162, § 26.26-26.29, pp. 440-442). Let a and a + 5 he probabilities for x to take 
values less or equal to Xi and X2, respectively, given the value 9 of the parameter of the 
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(continuous) sampling distribution for x: 

a + 6 = P{x<X2{d)\9l) = F{x2{e),e) . 
Then, the probability for x to take a value in the interval (xi(^), X2(6')), equals 5: 

p{x^{e) <x< X2{e)) = F{x2i9), 9) - F{x^ie),e) = s , (127) 

regardless of the value of a. Let unique values xi{9) and X2{9), satisfying (I126I) . exist 
for every 9 in the range (9a, 9i,), and for every a E (0,1 — 5). It is easy to see that 
unconstrained location and scale parameters with fi E (—oo, oo) and r E (0, oo) meet 
such a requirement . The curves Xi(9; a) and X2(9; a + 5) are formed by varying 9 in 
xi(9) and X2(9) but at fixed a and S (see Figure|2l) and the region between the two curves 
is known as a confidence belt. 

Having the confidence belts defined, we return to the estimation of unknown values 9i 
of the parameter. Given the measured value Xj, the required confidence interval (9n, 6'j 2) 
is obtained by the intersection of the vertical line x = Xi with the curves Xi(9] a) and 
0:2(6'; a + 5). By inspecting Figure|2lit becomes evident that the proposition 

9^E (9,^^,9,^2) (128) 

is true if Xi E {xi(9i), X2(9i)) and is wrong otherwise (see, for example, Xi and the 
corresponding confidence interval (6*2 1, 6*2 1) in FigureElb). This means that, according to 
(I127I) . the probability for (11281) to be true is exactly 5 for every 9i, i.e. in the long run our 
inferences will be correct in 5 per cent regardless the distribution of the values 9i. 
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In the case there exist unique values 9i^2 G {6a, 0^) for any a G (0, 1), any 5 G 
(0, 1 — a) and for any xi^2 G (xa, Xb), that solve equations (I126I) . the cdf for x is either 
strictly decreasing or strictly increasing in 0. Then, by construction of confidence belts, 
the relations 

F{xi,6i^i) = a + (5 , 
F{xi,9i^2) = a , 

are always true for cdf's strictly decreasing in (Figure |2|a), as is the case for functions 
with being a location or a scale parameter (see Section|4|), while for strictly increasing 
distribution functions (FigureElb), the relations read: 

F{xi,6i^i) = a , 
F{xi,9i^2) =a + 6 . 

Accordingly, the fraction 5 of correct inferences can be expressed as: 

S = ±F{xi, Oi^i) T F{xu 9i,2) = ±F{x,, T F{xu 9i,i + AO) , (129) 

where the upper (lower) sign corresponds to cumulative distribution functions strictly 
decreasing (increasing) in 9, while 

A.9 = 9i^2 — 9i^i . 

In the limit of infinitesimally small fractions 5 and differences A6' = d9, equation (11291) 
can be rewritten in terms of a differential of the distribution function with respect to 9: 

5 = T {g^Fix,, 9)^ d9 = tF2{x„ 9) d9 . (130) 

But recall now our probabilistic inference about 6'^: given the pdf f{9\xil), the prob- 
ability for 9i being in the interval (9,9 + d9) reads: 

Pi9\xil) = fi9\xj) d9. 

Thus, we finally came to the point when we are able to answer: our inference will be 
calibrated, i.e. the assigned probability will coincide with the fraction 5 of confidence 
intervals containing the true values of the parameter, if and only if 

f{9\xJ) = TF2{x,,9) . (131) 

For location parameters with the cumulative distribution function (l35t for the sam- 
pling variable x, the condition (11311) implies the calibrated pdf for 9 be: 

f{fi\XiI) = -— / (f){u) du = (f){Xi - yU) = f{Xi\liI) . 
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The above pdf for 6 coincides with the one obtained by following the basic Desiderata of 
consistent inference, implying the consistency factor 7r(/i) to be constant (II 101) . There- 
fore, the only consistent way of inference about location parameters is also the only one 
that is calibrated. 

The latter is also true for inferences about scale parameters with the appropriate dis- 
tribution function (l39b . Namely, 

f{r\tj) = -^l^ <Pi^) ^« = 71 = 7 /(^^l^^) 

corresponds to 7r(r) = r^^ as already determined in dllOl ). 

The probability distributions for location, scale or dispersion parameters that were 
assigned in consistent way, passed an important test: they are all calibrated. The question 
can be raised whether there are any other types of parameters that are also in accordance 
with the calibration requirement ( 11311) ? We restrict the answer only to parameters whose 
pdf can be written in the form of the Consistency Theorem (l^?l) . obtained by requiring 
logically independent pieces of information to be commutative. By combining the two 
equations we obtain: 

TT{e)Fi{x,e)±r]{x)F2ix,e) = , (132) 

where the upper (lower) sign stands for cdf 's which are strictly decreasing (increasing) in 
9. By defining function G{x, 9) as a difference (sum), 

G{x,9) = h{x)Tk{9) , 

with h{x) and k(9) being related to 7r(9) and r](xi) as 

h'{x) = 7]{x) and k'{9) = 7r{9) , 

equation (11321) can be rewritten as 

9) G2{x, 9) - F2{x, 9) ^) = , 

with Gi{x,9) = r]{x) and G2{x,9) = ii(9) being strictly positive functions (see Sec- 
tion|6l). But as we saw in Section|5l the general solution of such a differential functional 
equation is a distribution function F{x, 9) of the form 

F{x,9) = ^z-^^), 

where 

z = h{x) and /i = ±k{9) , 

i.e. a distribution function that corresponds to fx being a location parameter ( B51 . There- 
fore, in the limit of complete prior ignorance, an inference about a parameter 9 that is 
subject to the calibration condition (11311) . is necessarily reducible to an inference about 
a location parameter. Note that this result was first obtained by Dennis Lindley ll28ll by 
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combining the calibration condition (11311) and the Bayes' theorem with a prior pdf f{9\I) 
which is independent of data x. 

It is interesting to note that the above result is identical to the result obtained at the end 
of Section|5] and discussed in Section|7l despite the fact that the requirements of logical 
consistency and that of calibration may appear, at least at first sight, as almost diamet- 
rically opposed starting points. As a very important consequence, every probabilistic 
inference about a parameter of a sampling distribution that we are sure is consistent will 
thus at the same time also be calibrated and, vice versa, every calibrated inference, based 
on a posterior probability distribution that is factorized according to the Consistency 
Theorem, will simultaneously be logically consistent, too. 

Recall Section|2l where we chose a set of functions, called probabilities, out of all 
possible plausibility functions, suitable for representing our degree of belief according to 
the basic Desiderata. The main reason for such a choice was that for the probabilities the 
basic equations for manipulating plausibilities - the product and the sum rule, dU) and Q- 
are of especially simple forms. But now we realize another advantage of probabilities 
over other plausibility functions: it is only for probabilities that the assigned degrees of 
belief exactly coincide with the long term relative frequencies. Other plausibilities are 
one-to-one functions of probabilities, so their values correspond to (the same) one-to-one 
functions of the relative frequencies. Again, this does not imply that predictions in terms 
of probabilities are more reliable than predictions in terms of any other set of plausibility 
functions: they are just the easiest to interpret. 

The predictions of the systems of plausible reasoning, however, that are not isomor- 
phisms of the probability system, are, apart from being inconsistent, also necessarily un- 
calibrated. This means that in general their predictions in terms of long-range frequencies 
are not correct. 

Take, as an example, the power-law distribution with the pdf 

f{x\ei) = {e + i)x' , (133) 

the ranges of x and 

rr G (0,1) and 6* e (-1,00) , (134) 

and the corresponding cdf 

F{x, 6) = Hi + 9) x'^ dx' = . 
Jo 

Due to the range of x, the cdf is strictly decreasing in 6, therefore the calibration condition 
for the inference about the parameter reads: 

fi9\xl) = -F2{x, 9) = {e + 1) x' • (135) 

The ratio of the pdf 's for 9 and x, (fT33l and (fT35l . 

f{9\xl) —x\n.x 
f{x\9I) " ITT ' 
implies consistency and normalization factors of the form 

= ITT 
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and ^ 

Vix) = ] • 

Their integrals, 

../.(x)=/,(x)dx = -ln(-lnx) 

and 

^ = kie) = j TT{e) d9 = in{i + e) , 

with the domains of both being the complete real axis (c.f. equation (I134t ). 

z, /i G (—00, cxd) , 

allow for a reduction of the inference about parameter 6 to an inference about a location 
parameter fi of the sampling distribution of the variate z with the corresponding cdf: 

F{z,fi) = $(-ln(-ln2;) -ln(l + 6l)) = ^>(z - /i) . 

Indeed: 

= f{x{zmji)l) = e-(^-^)exp{-e-(--^')} . 

As a counter-example, i.e. as an example with the cdf of the sampling distribution 
being neither strictly increasing nor decreasing with respect to all of its parameters, we 
refer to the Weibull distribution (Ol, § 5.33, pp. 189-190) with the pdf of the form: 

The range of the variate t, as well as the ranges of both parameters 6 and r, coincide with 
the positive half of the real axis: 

t,T,e e (0,00) . 

Let the value of the scale parameter r of the distribution be known and let x be a normal- 
ized variate: 

t 

X = — , X G (0, 00) , 
r 

with the appropriate pdf, 

f{x\9I) = f{t{x)\eTl) |£| =0x''-iexp(-x^) , (136) 

and cdf: 

F{x,e) = 1 -exp(-x^) . 
Clearly, the cdf is decreasing in for x < 1, but increasing in for x > 1. 
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13 Reduction to inference about location parameters un- 
der more general conditions 



We can provide a consistent and calibrated parameter inference only when the problem 
is reducible to estimation of a location parameter. But, as we saw in the case of the 
WeibuU distribution, reduction to a location parameter estimation is not always possible. 
Nevertheless, under very general conditions, there is a neat way out of such difficult 
situations. 

When lacking any additional prior knowledge, all information about an inferred pa- 
rameter of a specified sampling distribution that can be extracted from the measured 
events, is contained in the value of the likelihood. After collecting n independent events 
Xi, the likelihood density for x reads: 

n 

f{^={XuX2,...,Xn)\9I) = l[f{x,\9I) . 

i=l 

Let 6 be the value of the parameter that maximizes the value of the likelihood density, 
given a particular set of collected events x: 

9 = 9{xi,X2,:.,Xn) = 0{x.) : /(x|6'/)|g^g- = max. 

Suppose also that the integral 

R'{9) = I [^\nf{x'\9I)yf{x'\9I)dx' 

exists for every 9 in its permissible range. Then, as the number n of measurements Xi 
increases, the sampling distribution for 9{x) converges to a Gaussian distribution with 
^ = ^ and cr{9) = {R{9)^Y^ (see, for example, ||20||, § 18.10, pp. 52-53 and § 18.16, 
pp. 57-58). The width of the distribution decreases with increasing n. For a sufficiently 
large number of measurements the width can be approximated by a{9) ~ <y{9). 



R{9)^ ' 

leaving as a pure location parameter of the distribution for ^(x), 

m9I) = -^e.J-^^^\, 
V2^a{9) \ 2a\9) J 

thus allowing to make a consistent inference about 9. The required number of collected 
events depends on the precision required for the inference about the parameter and on 
the form of the sampling distribution (see Figure [13] with two examples for the WeibuU 
distribution). 



50 








e(x) 



Figure 3: Sampling distribution for ^(x) resulting from maximizations of the likelihood 
density f{x.\9I) = YYi=i fi^il^-^) '^^at is a product of WeibuU pdf's ( 11361 ) (continuous 
line) and the limiting Gaussian or normal distribution A^(/i = 9,a = {R{6)y/n)^^^ 
(dashed line) for = 5 and for n being a) 5 and b) 100. 

14 Inferring parameters of counting experiments 

Let an urn contain balls that are identical in every respect except that some of the balls are 
coloured white and the remaining ones are coloured red. Such an urn is usually referred 
to as the Bernoulli urn and drawing from the urn is referred to as a Bernoulli trial. We 
draw a ball from the urn blindfolded, observe and record its colour, put it back into the 
urn and thoroughly shake the urn in order to minimize any possible correlations between 
successive draws. Then we repeat the process until uq balls have been drawn, out of which 
n have been recorded to be white (0 < n < no). If correlations between the draws can be 
neglected, the probability of recording n white balls in uq draws is given by the binomial 
distribution (0, § 5.2-5.7, pp. 163-168): 



p{n\9nQl) 



no 



n 



\no-n 



where the parameter 9 of the distribution coincides with the fraction of the white balls in 
the urn. 

On the basis of known n and uq we would like to make a probabilistic inference about 
the value of the parameter 9. Lacking any additional prior information, we write the pdf 
for 9 in accordance with (1^ : 



f{9\nnoI) 



n{9) p{n\9noI) 



TT 



(0) 



p{n\9noI) 



7r{9') p{n\9'noI) d9' vin,no 
where the form of the consistency factor it (9) is yet to be determined. 
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We showed that we are able to uniquely detemiine the form of a consistency fac- 
tor only in presence of invariance of the sampling distribution under a continuous (Lie) 
group of transformations. But in the case of counting experiments, with a discrete variate 
n of the sampling distribution, such an invariance is evidently absent. In this way we 
loose ground to uniquely determine the form of the consistency factor simply by follow- 
ing Cox-Polya-Jaynes Desiderata. If we insist on making a 'probabilistic' inference, we 
are therefore restricted to using 'consistency factors', also referred to as non-informative 
priors, whose forms are chosen on the basis of some ad hoc criteria. But no matter how 
carefully these criteria and non-informative priors are specified, there is no guarantee 
that in this way our reasoning remains consistent. Then, without being protected by the 
Desiderata, we stand at the mercy of all kinds of paradoxes, stemming from inconsisten- 
cies that we may unintentionally have committed. 

As an example, imagine a large number of urns, each containing an unknown fraction 
of white balls. We make a series of Bernoulli trials by drawing a single ball from each of 
the urns, i.e. uq = 1 for each of the draws. The outcome of drawing from the i-th urn can 
be a white {n = 1) or a red (n = 0) ball and the corresponding likelihood reads: 



p{n\noeiI) 




n = 1 
n = 



where 9, is the (unknown) fraction of white balls in the i-th urn. Let us try to make a 
'probabilistic' inference about the parameter by using a uniform non-informative prior 
distribution of 'probability' for Oi. 

j{e\iy = i. 

Then, a 'pdf can be assigned to 9, simply by using Bayes' theorem (l25t : 

/„ /(9.|/)>(nM./)d9, \2(l-«,); n = 
If (11371) is truly a pdf for 9, then a 'probability' 

1-92 

'P{9, e {9,,92)\nnoiy = / ' f {9,\nnoiy d9, 

should cover the true value of 9, in 100 'P[9i E (6*1, 6*2) |nno/)' per cent of the inferences, 
but it is easy to see that such an inference is in general not calibrated. Let us choose the 
shortest intervals with the 'probability' of containing 9i being 50% (see Figure fT4b: 



^1, ^2; 



n 



(0,1-^) ; n = 

If, for example, the fraction of white balls in each of the urns were exactly one-half, our 
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Figure 4: 'Probability density functions' assigned to the parameter of a binomial distribu- 
tion when the number of draws is no = 1 and the result of drawing is either a) favourable 
(n = 1) or b) non-favourable (n = 0). The distributions are assigned by choosing a uni- 
form non-informative prior. The integrals of the 'densities', 'probabilities' 'P' (hatched 
regions), on the intervals a) (^i = ^2/2, 62 = 1) and b) (^i = 0, ^2 = 1 - ^2/2) amount 
to 0.5 



intervals would never cover the true value, i.e. our inference is manifestly non-calibrated. 
Non-informative priors of different forms, for example 



'fio\iy 



1 



or '/(0|/)' 



1 



(see ITl, § 12.4.3, p. 384, eq. (12.50), and El, § 1 1.3, p. 106, eq. (1 1.63)), are not immune 
to such kind of problems, either. 

As on two previous occasions, such an obstacle in the way of consistent parameter 
inference can be overcome simply by collecting more data, such that 



no — n > 1 , 
n > 1 . 



(138) 



When the above condition is fulfilled, the sampling distribution for Xn, 



n 



no 



converges to a Gaussian distribution N{fi, a) {[12], § 4.15, pp. 138-140) with 



H{9) = 9 and a{e) 



'0(1 



no 
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in the sense that 



n2 n2 



p{n\no9I) = ^ p{xn\no9I) 

n=ni n—n\ 

f{x\^al)dx= — == exp<^ — — V dx . 

X Jx --^ V27r(T I 2cr"' I 

The dispersion parameter (t(^^) of the limiting distribution decreases with increasing num- 
ber of draws from the urn and for sufficiently large no it can be approximated by 



a{e) ~ a{xn) 



no 



In this way, becomes a pure location parameter of a Gaussian distribution with the 
corresponding uniform consistency factor and with the corresponding pdf: 

f{e\x^noI) = ^,p{x^\no9I) = ^\ ^ exp/ ^"^^ ~ 



VM V27ra{x„) { 2a'^{xn] 

At this point we find it appropriate, for the sake of completeness, to make a comment 
on the density of information nnol that probability p(9\nnQl) is conditional upon. Re- 
call that a continuity requirement concerning sets of different states of knowledge about 
inferred propositions was listed within the common sense Desideratum//. But with / 
merely representing information about the type of counting sampling distribution, and 
with uq = 1 being a fixed number of experiments performed, the possible information 
about 9 after the first trial consists two atoms, n = and n = 1, that do not allow for 
such a requirement to be met. In this way the proof of Cox's Theorem (see, for example, 
ifTOl . or |l2|, § 2.1-2.2, pp. 24-45), stating that the basic Desiderata necessarily imply (up to 
an isomorphic transformation) the product rule dU), the sum rule Q, and their corollary, 
Bayes' Theorem (l25t . misses an indispensable fact. That is, in a situation like that there is 
no explicit reason why an inference about a parameter of a sampling distribution should be 
made in accordance with the Consistency Theorem that is deduced by assuming, among 
other things, Bayes' Theorem to be the only consistent way to update the probability dis- 
tribution for the inferred parameter (recall Section|6l). However, in the dense limit (I138I) . 
the continuity of information (i.e. of n/no) is recovered and the procedure of inference 
about 9 becomes uniquely determined by the adopted Desiderata. 

Note that for a consistent and calibrated inference both conditions of (11381 ) must be 
met. Suppose for a moment that only 

n-o — n > 1 

holds so that the binomial sampling distribution can be approximated by the Poisson limit 
(O, §5.8-5.9, pp. 168-171): 

p(n|/./) = ^e-^ 
n! 
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where the parameter /i of the distribution represents the expected number of white balls 
drawn. Suppose that only red balls are drawn, i.e. n = 0, and that we want to make 
a 'probabilistic' inference about /i by choosing a uniform non-informative prior. The 
corresponding 'pdf ' for /i then reads: 

Jim Ui) j-,;,^,|;),p(,„_o|f.7)df.' 
which implies a 'probability' for fx < Ho, 

MO 



or, equivalently, implies a 'probability' for fi > /iq. 



POO 

'P{fi> fXo\niy = 'f{fx\n = Oiydfx = e- 

J an 



-Mo 

'Mo 

where /io takes an arbitrary positive value. For example, for fiQ = 3, the two 'probabilities' 
equal ~ 0.95 and ~ 0.05, respectively. 

Imagine now such draws from a number of urns N, all ending up with = 0. We can 
make use of the general sum rule © and calculate a 'probability', 'Pat', for at least one 
out of parameters fii being greater than (iq. Since the draws from different urns are not 
correlated, the 'probability' equals: 

'P7v' = l-(l - 'Pifi> fiolniyy . 

With increasing N, the value of 'P/v' approaches unity: for a sufficiently large number 
of urns we can claim with 'certainty' that at least for one of the urns the parameter /ij is 
greater than fiQ, regardless of the chosen value of the latter. But claiming the existence 
of white balls in the urns on the basis of observing only red ones, is clearly a logically 
unacceptable result, pointing to serious flaws of this kind of inference. Note that it was 
not the choice of the uniform non-informative prior that was decisive for the above result 
since every '/(/i|J)' with the existing integral 

'f{fi'\iyp{n = 0\fi'l)dfi' 





would lead to the same kind of a logically unacceptable conclusion. 
If an urn contains only red balls, the requirement 

n > 1 

can never be met. Is there some kind of inference that could be made in such cases? Let 
us perform two Bernoulli trials by drawing uq and 2no balls from an urn, with all of the 
drawn balls in both trials being red. The evidence against the presence of white balls in 
the urn that was obtained by the second trial may be reasonably held to be stronger than 
the evidence from the first trial. Yet how much stronger! It seems to us that in such cases 
the degrees of belief, although still comparable, cannot be expressed quantitatively (see 
also h^. Chapter XV, p. 137), i.e. in order to avoid all sorts of paradoxes we should remain 
on a qualitative level. 
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15 Historical digression 



In Section|6lwe stressed that, contrary to the prior probabihty distribution in Bayes' The- 
orem (l25]-l26b. the consistency factor in equations (l63l-l66b does not represent any kind 
of probability distribution. Overlooking this fact led to perpetual philosophical argu- 
ment throughout the history of probabilistic reasoning, with far-reaching practical conse- 
quences. 

A natural starting point for every sequential updating of information is a state of com- 
plete ignorance, e.g. complete prior ignorance about the value of an inferred parameter 6 
of a sampling distribution for x. The original sin is then committed in an attempt to make 
use of Bayes' Theorem, (l25t or (l26t . in such a situation. According to the Theorem, in 
order to obtain a probability distribution for the inferred parameter after observing xi both 
the likelihood, i.e. the probability for observing xi given a particular value of the param- 
eter, and a prior probability distribution for 6, i.e. a distribution of our belief in different 
values of the parameter prior to observing xi, must be provided. With the known form of 
the sampling distribution, the calculation of the likelihood is a rather straightforward task. 
The problem arises when we try to formulate the distribution of our belief prior to the 
first recorded event, for up untill then we had been completely ignorant about the possible 
values of 0. What we are trying to do is to assign a prior probability distribution based 
on ignorance, i.e. we are trying to establish the so-called ignorance or non-informative 
probability distributions. But according to the definition of probability adopted as a de- 
gree of reasonable belief that is based on relevant information at hand (see Sectionl^J, 
a probability assigned on grounds of ignorance is simply a contradiction in terms. A 
prior probability distribution that is based only on ignorance thus cannot be the realm of 
a consistent probability theory. We will see that apart from being self-contradicting on 
the conceptual level, the concept of ignorance or non-informative priors inevitably also 
produces many practical inconsistencies and paradoxes. 

This delusion about probability distributions based on ignorance has been present for 
more than two centuries. In a scholium to his essay fTSl Bayes suggested that in the 
absence of all prior knowledge it is reasonable to assume a uniform distribution for p, 
where p stands for, for example, an unknown fraction of white balls in an urn experiment 
as described in Section[T3| Laplace ([|51, p. XVII) was also very explicit on the same 
subject: "When the probability of a single event is unknown we may suppose it equal 
to any value from zero to unity." The above assumption is usually referred to as the 
Bayes principle ([ 12|, § 8.19, p. 298), the Laplace principle of insufficient reason BIJ or 
the principle of indifference ([2|, § 2.4, p. 40). When applied to estimation of a general 
unknown parameter, the assumption would read: in the limit of complete prior ignorance 
about the value of an inferred parameter, the prior distribution should be uniform. 

We strongly disagree with such a rule and propose a very simple rule to replace it: 
when there is no (prior) information, no (prior) probabilities are to be assigned whatso- 
ever. Or to paraphrase deFinetti: prior probability does not exist. Knowing that a uniform 
prior probability distribution in the range {6a, 9b) has been assigned to the value of a pa- 
rameter 6* as a result of positive knowledge, and not knowing anything about 9 with the 
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exception of its admissible range, are two fundamentally different states of knowledge. 

In order to illustrate the difference, we return to drawing coloured balls from urns. 
Imagine having a good number of urns, each containing red and white balls, with the 
fractions of white balls, 9i, in each of the urns being unknown. According to the first 
scenario, we possess information that the fractions are distributed uniformly in the range 
(0,1), while according to the second scenario we know nothing about the distribution of 
the values of 6i. It is easy to see that the two states of knowledge are completely different. 

In the first case it is possible to assign a probability for Oi of the i-th urn which is in an 
arbitrary interval (6*1, 6*2) C [0, 1] even before drawing the first ball from the urn simply 
by integrating the (uniform) prior probability distribution: 

P{d,e{d,,d2)\l)= r f{d\l)dd= f'd9 = 92-9^. (139) 

On making the first draw from the i-th urn, it can be either a white (rij = 1) or a red 
(rij = 0), implying the corresponding likelihood to be: 



p{ni\9il) 



The updated posterior probability therefore reads: 




Ui = 1 

Hi = 



where the update was made by using Bayes' Theorem (E51) . Note that both probabilities 
are calibrated: when repeating the above assessments for all of the urns, intervals {9i, 6^2) 
contain the true fractions of white balls in the each of the urns in 100P[9i E (6*1, 6^2) |/) 
percent of cases before any of the balls were drawn, while after drawing the coverage is 
exactly P(ei G (^1, ^2)|^i/). 

Within the second scenario, with complete prior ignorance about the distribution of 9i 
within the urns, we cannot make a probabilistic inference about fraction of white balls in 
a particular urn before any of the balls is drawn. The statement that the probability for 
9i to be in the interval (9i, 92) equals the length of the interval (see eq. (11391 )) would in 
general be uncalibrated unless the distribution of 9i within the urns is truly uniform - but 
this we do not know and need not be true. The same holds for the probability statement 
( 11401) after drawing one ball from each of the urns. We saw in the previous section that 
in the limit of complete prior ignorance, consistent and calibrated probability statements 
about the parameters 9 can be made only after drawing enough balls from each of the urns 
so that the Gaussian limit of the binomial sampling distribution can be applied. 

Many failed to recognize the fundamental difference between knowing the prior prob- 
ability distribution of an inferred parameter to be uniform and not knowing anything about 
the value of the parameter. Harold Jeffreys, for example, wrote ([15|, § 1.22, p. 29): "If 
there is originally no ground to believe one of a set of alternatives rather than another. 



57 



their prior probabilities are equal." The same standpoint was persistently advocated also 
by Edwin Jaynes ([2J, § 18.1 1.1, p. 573): "Before we can use the principle of indifference 
to assign numerical values of probabilities, there are two different conditions that must 
be satisfied: (1) we must be able to analyze the situation into mutually exclusive, exhaus- 
tive possibilities; (2) having done this, we must then find available information gives us 
no reason to prefer any of the possibilities to any other. In practice, these conditions are 
hardly ever met unless there is some evident element of symmetry in the problem. But 
there are two entirely different ways in which condition (2) might be satisfied. It might be 
satisfied as a result of ignorance, or it might be satisfied as a result of positive knowledge 
about the situation."^ 

We could not agree more with Jaynes if it were not for the last sentence. Take, for 
example, two different samples of radionuclides. For the first sample we do not know 
anything about the isotopes except that the permissible range of their expected decay 
times is in an interval (tq, Tb), while for the second sample the isotopes were chosen in 
such a way that the distribution of their expected decay times can be well approximated 
by a uniform distribution in the same interval. As a result of ignorance, we cannot make 
inferences about the expected decay times rj of the isotopes in the first sample before mea- 
suring their actual decay times U. After the decay time U of the i-th isotope is measured, 
however, we can make a probabilistic statement about Tj according to (l64b : 

where / stands for prior ignorance about the expected decay time of a particular radionu- 
clide, while the consistency factor vr(rj) was chosen according (II 101) . From SectionfTTl 
we recall that our inference is consistent if the integral of the above pdf outside the ad- 
missible range for is small compared to the precision required. For the second sample 
of isotopes, as a result of positive knowledge, there is a pdf for the expected decay time 
of each of the isotopes at our disposal even prior to measuring the actual decay times: 




Ta<Ti< Tb 

Otherwise , 



where I' stands for positive prior knowledge. After is measured (and not knowing 
the results of other possible measurements of decay times from the same sample), the 
distribution of our belief can be updated by means of Bayes' theorem (l29t : 



j::f{T'\I')nt.\r'J')dr-- 

Note that when appropriate prior pdf 's exist, the inferences are always consistent and cali- 
brated. Figure 5 shows examples of inferences with and without existing prior probability 
distributions. 



very similar idea expressed in very similar words by Anthony O'Hagan can be found in f 161, § 4.39, 

p. 112. 
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Figure 5: Probability density functions for the expected decay time Tj of the z-th isotope 
from a sample of isotopes with the expected decay times distributed approximately uni- 
formly in an admissible range (tq, ti,) = (0, 10) (hatched areas), and the corresponding 
pdf for an isotope from a sample with the expected decay times in the same interval but 
with an unknown distribution of their actual values (shaded area). Figure a) displays the 
pdf before any of the actual decay times are measured, while Figure b) shows appropri- 
ate pdf's based on a measured value tj = 1 in both cases. All decay times are given in 
arbitrary but equal units. While the hatched histograms are limited within the permissible 
range, the integral of the shaded pdf in the range (10, oo) is approximately £2 = 0.095. 



Apart from the problems with calibration, the Bayes postulate contains also some 
insurmountable inconsistencies. Suppose we are estimating a parameter 9 with no prior 
information available. According to the postulate, the prior 'probability' distribution in 
such a case is given by a uniform 'pdf: 

'f{e\iy = i. 

But, instead of 9, we could have equally well chosen different parameterization, say \{9), 
where 

6 > X (141) 

is a bijective parameter transformation. Then, due to the consistency Desideratum ///.c, 
the appropriate prior 'pdf for the transformed variate A, obtained according to (fT6b . reads: 



'/(A|/)' = '/(W|/)' 



dO 




de 


dX 




dX 



(142) 



In the case of a non-linear transformation (11411) . the absolute value of the derivative in 
(11421) is not a constant, i.e. the prior 'pdf for A is not uniform. But since a one-to-one 
mathematical transformation like (11411) does not change the state of knowledge about the 
inferred parameters, we also remain completely ignorant about A. The Bayes postulate 
would therefore imply a uniform prior distribution for A, which obviously contradicts 
(11421) . That is, Bayes postulate in general contradicts consistency Desideratum III.c). 
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A sophisticated version of the principle of insufficient reason is referred to as the 
principle of maximum entropy. The term information entropy was introduced by Claude 
Shannon (1 29 1, §6) as 

S = - Inpi , 

i 

where pi is the probability assigned to inferred parameter of taking a value within an 
interval (6'j, 6i + dOi), and the sum covers the whole admissible range of 6. In order for it 
to be used for determination of non-informative prior distributions, Pi is interpreted as 

p, = 'p{9,\iy = 'f{e,\iyde,, 

where ' f{9i\iy stands for the non-informative prior 'pdf for 9. The principle of maxi- 
mum entropy then states ([2J, § 1 1.3, pp. 350) that the function 'p(9i\I) ' which maximizes 
entropy represents the most honest description of what we know about the value of the 
inferred parameter. 

Suppose, for example, that we know only that 9 G {9a, 9b) is a parameter of a specified 
sampling distribution, and that the prior 'probability' distribution for 9 is subject to the 
usual constraint 

$^m|/)' = 5^P,. = l- (143) 

i i 

Then, the principle of maximum entropy and the above constraint are taken simultane- 
ously into account if pj maximizes the function H, 

H = S - a(^pi - l) = -J^Pi In Pi - a(^pi - 1 

ill 

that is, if 

dH 

— = -(lnp, + l + a) = 0, 

dpj 

where a is a Lagrange multiplier. It is therefore the constant non-informative prior 'prob- 
ability' distribution 

p. = 'f{9j\iy d9j = e-(^+") = constant 

that maximizes the information entropy, subject to the usual normalization condition 
( 11431) . If we choose intervals of uniform width 

d9i = constant , 

the principle of maximum entropy yields a uniform non-informative prior 'pdf. But a 
non-linear one-to-one re-parameterization 

9 > X{9) 

implies widths of intervals 

d\ 

dXi = d9i 
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and the corresponding 'pdf ' 



7(A.|/)' = 'f{e,\iy 

neither of the two being uniform. And so an immediate question can be raised which 
(and why) is the distinguished parameterization of the sampling distribution with both the 
interval widths and the non-informative prior 'pdf being uniform. In other words, this 
means that the principle of maximum entropy cannot solve the ancient ambiguity of how 
to find the elusive non-informative 'probability' distributions. 

Jeffreys ( ifTSl . § 3.1) proposed the following solution to the problem of non-informative 
priors. He suggested that for parameters with the admissible range coinciding with the 
whole real axis we should keep to Bayes postulate, i.e. to the uniform priors, while for 
parameters known to be positive the proper way to express complete ignorance is to as- 
sign uniform prior probability to its logarithm, i.e. in the latter case the prior 'pdf should 
be: 

'/(^lJ)'oc^ ; ^G(0,oo). (144) 
He tried to justify this on the grounds of invariance of the former prior under translation, 

X = 9 + b ; 6 G {—oo, oo) , 

and on the grounds on invariance of the latter prior on raising to a power of n or to 
scaling it by a positive constant a: 

A = ^" and \ = aO . 



OA 
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Why invariance of the non-informative priors under these particular transformations? 
For the location, scale and dispersion parameters the answer was extensively searched for 
by utilizing related transformation groups (see refs. I.30J . L31J . 1.1 8J and Chapter 12 in 

mi 

For example, when we are completely ignorant about location yU of a distribution for 
X prior to starting to collect data, a mere shift of location, 

/i > fi' = fi + b, (145) 

could not change our state of knowledge. That is, according to the principle of group 
invariance, the non-informative prior 'probability' distribution for ji should be invariant 
under a group of transformations (11451) . It is easy to see that the uniform prior is the only 
one that satisfies this condition. 

Similarly, when we are completely ignorant about the scale r of a distribution for t, 
the prior 'probability' distribution for r should be invariant under a scale transformation, 

r > t' = aT . (146) 
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The only non-informative prior 'probability' distribution satisfying the above condition, 
is expressed by Jeffreys' prior (|144ll . To see that, we should realize that the required 
invariance implies equality for prior 'probabilities' 

'p(r|/)' = '/(r|/)'rfr = 0(r)cir 

and 

'p(r'|/)' = '/(r'|/)'dr' = (f){T') dr' = <p{aT)adT , 

or, equivalently, 

(/)(r) = (j){aT) a . 

If this is to be true for any a E (0, oo), it must also be true for a = r^^. Hence: 

'/(r'|/)' = 0(r) = 0(l)i, 

r 

and the above statement about Jeffreys' prior is proved. 

In the same spirit, a non-informative prior 'probability' distribution representing com- 
plete ignorance about both location fi and scale a should be invariant under simultaneous 
location and scale transformations: 

a > ii' = au + b , 

(147) 

a > a' = acr . 

Following the idea of the proof from the preceding example leads to a non-informative 
'pdf ' of the form: 

'/(/ia|/)' = ^ . (148) 

However, strong objections can be raised against the form of (I148I) . For example, non- 
informative prior (11481) leads to a marginalization paradox (see, for example, references 
lB3 and O^ . and Chapter 10, pp. 81-94 of reference ifTHll ). Let /i be a location parameter 
with known fixed value, i.e. we are inferring only a dispersion parameter a. By inserting 
the non-informative Jeffreys' prior (11441) in the Bayes' Theorem (l26b we obtain: 

' M^^xsIy = 7(a|/r/(xs|/ia/) 1 _ 

'fia'\iy f{xs\fi(j'I)da' a-'^ ' 

Note that the result is identical to the one obtained by inserting the appropriate consistency 
factor dl 101) into the Consistency Theorem (l64l) . But we can also obtain a 'pdf for a 
given /i, X and s in another way. We start with a simultaneous inference about unknown 
parameters fi and a: 

Then we make use of the product rule ^ and rewrite the above expression as: 

'f{fia\xsiy = 'f{a\fixsiy 'f{fi\xsiy , 
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where is marginalized 'pdf (see eq. (l23t '): 

'filj,\xsiy= 'fii^a'lxsiy da' . 
Jo 

By combining the last three equations we can therefore write down the 'pdf for a given 
H and observed x and s also as: 

7(^l/iXs/)' = ^!\[ll^fjy « ^ /(^^l^^/) • (150) 

The pdf 's (11491) and (11501) are in an obvious contradiction, i.e. the non-informative priors 
(|144|) and (11481) are inconsistent. For further discussion about the marginalization paradox 
see Appendix B. 

When recognizing this fundamental difficulty, some authors claim the procedure of 
marginalization (l23l) to be illegitimate (see, for example, ref. [18|, § 10, pp. 81-94 and 
§ 17.2, pp. 163-164), despite the fact that - according to Cox's Theorem - the procedure 
is implied by the basic Desiderata. Others apply transformations different from (11471) in 
attempts to solve the problem (see, for example, [2J, § 12.4, p. 378, equation (12.18)). 
But for the latter, since the modified transformations do not correspond to a simultaneous 
translation and scale transformation, the original motivation to relate the complete prior 
ignorance about location, scale and dispersion parameters to invariance of the correspond- 
ing non-informative priors with respect to the transformations (I145H147I) . is definitely lost. 

Note that the formalism related to the principle of group invariance of non-informative 
priors is remarkably similar to the formalism applied for determination of consistency 
factors. But no matter how strong this similarity may appear at first glance, there is a fun- 
damental difference between the two methods, leading to substantially different results. 
Group invariance of non-informative prior probabilities is imposed as a principle, addi- 
tional to the basic Desiderata, while we made use of invariance of likelihoods (being well 
defined probabilities) as a necessary condition for equivalence of two states of knowl- 
edge that led to functional equation (iTSt for consistency factors. In the latter case it is 
the ratio of ti{6) dO and rj{x) dx that is to be invariant under simultaneous transformations 
ga{0) £ Q and ga{x) e ^ of the inferred parameter and the sampling variate x, while 
7r(^) dO itself need not be invariant under Q since it is, by definition, uniquely determined 
only up to a multiplicative constant; and it is due to this degree of freedom that marginal- 
ization paradoxes, stemming from strict applications of the principle of group invariance, 
are avoided. 

In practice, the difference between the probability distribution assigned simultane- 
ously to a location and a dispersion parameter by multiplying the appropriate likelihood 
by the corresponding consistency factor (II 141) . and the 'probability' distribution assigned 
by utilizing the non-informative prior ( I148I ). will fade away with increasing number of 
collected events. So, from a pragmatic standpoint, arguments about which function cor- 
rectly expresses a state of complete prior ignorance might amount to quibbling over pretty 
small peanuts (|l2|, § 6.15, p. 183). But, from a standpoint of principle, this is definitely 
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not true, for the convergence of the limiting distributions by itself certainly does not guar- 
antee either of the two to be correct, i.e. we might have been completely wrong in both 
cases. Fundamental difficulties with non-informative priors introduced very serious con- 
sequences for inductive reasoning. For example, applications of the Laplace principle of 
insufficient reason that led to logically unacceptable results, provided Polya ([4|, Chap- 
ter XV, § 6, pp. 133-136) with the main reason to persist in a qualitative level when linking 
induction with probability. Others (see, for example, references |34| and [35|) refused 
even a qualitative correspondence. As noted already by Jeffreys (" ifTSl . § 3.1, p. 120), "a 
succession of authors have said that the prior probability is nonsense and therefore that 
the principle of inverse probability, which cannot work without it, is nonsense too." 

In this way two, at first glance fundamentally distinct, schools of inductive reason- 
ing emerged. The first one, usually referred to as the Bayesian school due to the central 
role of the Bayes' theorem in the process of inference, recognizes probability as a degree 
of reasonable belief and applies probability theory in the course of inductive reasoning. 
The second one, usually referred to as the frequentist school due to its strict frequency 
interpretation of probability, advocates the usage of the calculus of probability only for 
treatment of so-called random phenomena. The aim of the frequentist school is to avoid 
the supposed mistakes and inconsistencies of the probabilistic inductive inference, so they 
relegate the problems of inductive inference, e.g. estimations of distribution parameters, 
to a new field, statistical inference. Lacking applications of probability theory may, how- 
ever, represent a serious drawback when making inductive inferences. For example, the 
Fisher-Behrens problem, introduced in Section[lOl may become an insurmountable ob- 
stacle outside the probabilistic parameter inference (see, for example, |20|, § 19.47, Ex- 
ample 19.10, pp. 160-162 and § 26.28-26.29, pp. 441-442). In particular, difficulties stem 
from the illegitimacy of the marginalization procedure within the theory of statistical in- 
ference. 

Some of the substitutes for the calculus of probability that are proposed within the 
framework of statistical inference, are put forward as solutions of specific problems, such 
as the principles of least squares and minimum chi squared. The principle of Maximum 
Likelihood ll34ll . however, is usually advocated as one of general application (see [|12l . 
§8.22-8.27, pp. 300-304, and [20|, Chapter 18, pp. 46-104). The principle states that, 
when confronted with a choice of the values of a parameter 6 e {6a, Oh), we choose 
the particular value 6 which maximizes the corresponding likelihood, p{x\9I), for the 
observed data x. In general, the principle contradicts our basic Desiderata. Imagine a 
problem of inference when there is, apart from x, some additional information / at hand 
that would allow for an assignment of probability distribution for 6 on its own. Neglecting 
this information directly contradicts Desideratum Ill.b. For special cases, when the pdf 
for 6 based on prior information /, f{0\I), is uniform in a wide interval around 6, the 
contradiction is removed and the principle can no longer be disputed. 

When there is no prior knowledge about the value of a particular inferred parameter, 
according to the consistency Desiderata, the inference should be made on the basis of both 
the likelihood containing the information about 6 from the measurement x, and the consis- 
tency factor containing information about 6 coming from the known form of the sampling 
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distribution for x. Ignoring the latter in general implies our reasoning to be inconsistent, 
i.e. to be in a direct contradiction with Desideratum III.c. Imagine, for example, two per- 
sons inferring average decay times of unknown particles on the basis of single decay 
time measurements tj. Mr. A proceeds according to the Maximum Likelihood principle 
by extracting values 

of the parameter that maximizes the likelihoods 

p{U\tJ) = f{U\nI) dt=- e~'^l^^ dt 

for observing measured decay times in an interval (tj, ti + dt), given particular r/s. In 
accordance with the adhered principle, rf" should be the value of the parameter being 
indicated by the datum ti as the strongest candidate, i.e. in the long term the fraction of 
Mr. A's confidence intervals, (f/^, rf" + dr), covering the true values of inferred param- 
eters, should be larger than the corresponding fractions of any other interval of the same 
width, dr. Mr.B, on the other hand, chooses the consistent procedure: he extracts the 
values 

f ^ = - 
' 2 

of the parameters that maximize the pdf 's for (see equations (l64l) and (IllOll ). 

/(r.|t./) = ^/(t.hJ) = ^e-*-/-, 

taking this way into account both information from the immediate data tj, and information 
/ about the form of the sampling distribution that is contained in the consistency factor 
7r(rj). It is easy to see that, contrary to the claims of Mr. A, the coverage of Mr. B's 
confidence intervals for Tj, (7;^, ff + dr), surpasses that of Mr. A's intervals, (f/^, f^ + dr), 
by a factor of 4 ~ 1.47. 

Again, for special cases when inferring explicit location parameters, the consistency 
of the principle of Maximum Likelihood is restored. 

Last but not least, in order to avoid unnecessary though frequent misunderstandings, 
we would like to clarify the following. Most of the advocates of the so-called subjective 
Bayesian school of thought find both the existence of complete prior ignorance about 
the inferred parameter (ESI, §4.15, p. 102), and the existence of an exact equivalence 
of information possessed by two different persons inferring the same parameter ([16|, 
§ 1.16, p. 11), impossible and thus irrelevant to a theory of inductive reasoning. We believe 
that such statements are as poorly grounded as it would have been absurd, for example, 
rejecting the use of right angled and similar triangles when constructing images within 
geometrical optics just due to the fact that no real triangle is exactly right angled and that 
no two real triangles are exactly similar. In Sectional we demonstrated that a particular 
state of knowledge is exactly the same as the state after an arbitrary one-to-one variate 
transformation, while in Section|7] we stressed that ignorance is just a limiting state of 
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knowledge and the natural starting point in any actual inference, just as zero is the natural 
starting point in adding a column of numbers. 

Similarly, it might also have been argued for experiments like drawing balls from 
urns, tossing dice, or collecting the decay times of unstable particles, are oversimpli- 
fied and therefore not adequate for calibration of methods derived for the real scientific 
inferences about unknown parameters. But such simple (not oversimplified) experiments 
usually serve as a paradigm for all experiments with well known sampling distributions of 
the collected data, and with well controlled experimental conditions, both features being 
among the basic assumptions of our theory (see Sections|3]and|71). 

16 On the probability of general hypotheses 

Theories are nets: only he who casts will catch. 

Friedrich von Hardenburg (Novalis) 

Inference about a parameter, say 6, of a distribution for a certain sampling variate, 
say X, is, by definition, always conditional on a specified model (recall Section|3l). The 
value of a parameter is always estimated under the assumption that within the specified 
family of distributions, with each member of the family being completely determined 
by the value of the inferred parameter, there is a distribution, specified by the so-called 
true value of 9, that corresponds to the actual sampling distribution of x. How well can 
such an assumption be justified: what is the probability for the specified model to be 
true? Or, putting it in a wider framework, what is the probability for a general hypothesis 
A to be true? For example: what is the probability ^(^2!/) of Newton's law of universal 
gravitation |36|, here denoted by ^2, judged in the light of the facts / collected in the first 
edition of the Principia? 

In order to answer such a question, we must be able to analyze the situation into 
mutually exclusive, exhaustive possibilities Ai, A2, ... in order to allow for the usual nor- 
malization 

i 

Recall that the unity in the above expression is only a matter of convention within Desider- 
atum/, but a normalization of the probability for an exhaustive set to an arbitrary (positive) 
constant value is necessary in order to define the scale of the assigned probabilities. For 
example, without such a normalization it would have been impossible to say whether a 
certain probability, say p{A2\I) = 0.75, is either high or low. 

Unfortunately, the normalization represents a task that cannot be consistently accom- 
plished. First of all, the absolute status of a hypothesis embedded in the universe of all 
conceivable theories cannot be stated. That is, its probability within the class of all con- 
ceivable theories is neither large or small; it is simply undefined because the class of all 
conceivable theories is undefined ([2|, § 9.16.1, p. 310). 

Consequently, it would only be possible to express the plausibility of a hypothesis 
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within a class of well-defined alternatives. But what is the criterion that a hypothesis 
should satisfy in order to take a place within such a class? Let us, for example, choose a 
class of alternatives A„ to Newton's law of universal gravitation such that the gravitational 
attraction between two planets could be inversely proportional to the n-th power of the 
distance r between the planets, where n is limited to positive integers. Since we are 
not able to justify the restriction of n to positive integers only - these have been chosen 
completely arbitrarily - we extend the class by allowing n to take any real value. Then, 
since there is no obvious reason that the alternatives should be limited to power-laws, we 
add hypotheses from the exponential family to our class. And so on and so forth, into 
an infinite regress that brings us to the conclusion that there is no such thing as a well- 
defined class of alternatives that would allow for expressing the quantitative plausibility 
of a hypothesis. 

Scientific theories can never be justified, or verified. Under certain circumstances a 
hypothesis A can be trusted more than a hypothesis B - perhaps because B is contra- 
dicted by certain results of observations, and therefore falsified by them, whereas A is not 
falsified; or perhaps because a greater number of predictions can be derived with the help 
of A than with the help of B or because the likelihood for observing measured data 
X, given A is correct, p{x\AI), is higher than the corresponding likelihood 5/), con- 
ditional on B being true. The best we can say about an hypothesis is that up to now it has 
been able to show its worth, and that it has been more successful than other hypotheses 
although, in principle, it can never be justified or verified; we saw that we cannot even 
state its probability for being true. 

Some inductive arguments are stronger than others, and some are very strong. But how 
much stronger or how strong we cannot express ll38ll . If predictions made by a theory are 
borne out by future observations, then we become more confident of the hypotheses that 
led to them; and if the predictions never fail in vast number of tests, we come eventually 
to call them physical laws. On the other hand, if the predictions prove to be wrong, we 
have learned that our hypotheses are wrong or incomplete, and from the nature of the 
error we may get a clue as to how they might be improved (lEI, § 9.16.1, p.31 1). But there 
is absolutely no guarantee for the corrected theory to be correct. 

After all has been said and done, it becomes evident that there is nothing absolute 
about the theory of consistent inference about parameters of sampling distributions. The 
theory does not rest upon a rock-bottom: an inference about a parameter is necessarily 
conditional on a specified model (i.e. on a specified family of sampling distributions) 
whose truth we are never able to prove. The whole structure of the theory rises, as it 
were, above a swamp. It is like a building erected on piles. The piles are driven down 
from above into the swamp, but not down to any natural or given base; and when we 
cease our attempts to drive our piles into a deeper layer, it is not because we have reached 
firm ground. We simply stop when we are satisfied that they are firm enough to carry the 
structure, at least for the time being ([||8 1, § 30, p. 1 1 1). 
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17 Conclusions 



This article presents an attempt to formulate a consistent theory for inferring parameters 
of sampling probability distributions. The theory is developed by following general rules, 
referred to as the Cox-Polya-Jaynes Desiderata. We extended the existing applications of 
the Desiderata in order to allow for consistent inferences in the limit of complete prior 
ignorance about the values of the parameters. 

Starting from that limit, the Consistency Theorem (l63ll66b is to be used for assigning 
probabilities on the basis of the collected data. The form of the Theorem is very similar 
to the form of Bayes' Theorem (l25b that is used for updating the assigned probability 
distributions, but we stressed an important difference between the two. While in Bayes' 
Theorem the prior probability f{6\I) represents a distribution of credibility of different 
values of the parameter 9 that is based on information / prior to the inclusion of data x in 
our inference, tt(9) in Consistency Theorem is just a consistency factor that depends on 
the form of the sampling distribution and is determined in a way that ensures consistent 
reasoning. Contrary to prior and posterior probabilities for an inferred parameter 9, p{9\I) 
andp(^|a;/), to the corresponding pdf's, f{9\I) and f{9\xl), and to the likelihood for an 
observed x given 9, p{x\9I), the consistency factors by no means represent any kind of 
probability distributions and should not be confused with the ill-defined non-informative 
priors. That is, no probabilistic inference is ever to be made on the basis of the form of 
the consistency factor alone. When this is recognized, there is absolutely no need for the 
factors to possess (or not to possess) any of the properties of the pdf's: we find arguments 
for the factors to be either normalizable (see, for example, references ll33ll . lfT6l . § 3.27- 
3.29, pp. 77-78, and Q, § 15.12, p. 488) or non-normalizable (see, for example, ifTSl . 
§ 3.1, p. 121, or f2], § 15.10, p. 485) completely irrelevant. 

The developed theory is only an effective one. We met several examples where the 
prior information and the collected data were too meagre to permit a consistent param- 
eter inference. We saw that, under very general conditions, the remedy is just to collect 
more data relevant to the estimated parameter. Probability theory does not guarantee in 
advance that it will lead us to a consistent answer to every conceivable question. But, on 
second thoughts, this shuld not be too big a surprise, since we are accustomed to effective 
theories in all branches of science outside pure mathematics. Formulating an absolute 
theory of inductive reasoning, based on imperfect information, might just turn out to be 
an overambitious task. 

By giving up the idea of the existence of (self-contradicting) non-informative prob- 
ability distributions, and the illusion of an absolute theory, all paradoxes and inconsis- 
tencies, extensively discussed in the preceding sections, are solved and we arrive at a 
position where we can write down a logically consistent quantitative theory of inference 
about parameters. 

The theory is operational in the sense that it is verifiable from long range conse- 
quences. We saw that all the predictions were automatically calibrated, i.e. that the pre- 
dicted long range relative frequencies coincided with the actual frequencies of occurrence. 
This is a very important feature that allows for a reconciliation between the frequentist 
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and the Bayesian approaches to statistics, probably the same kind of reconciliation that 
Maurice Kendal ll39ll had in mind: "Neither party can avoid ideas of the other in order to 
set up and justify a comprehensive theory." In this way the distinction between the theory 
of probability and that of statistical inference might be removed, leaving a logical unity 
and simplicity. 
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Appendix 



A Cox's Theorem 

In what follows we present, for the sake of completeness, a proof of Cox's Theorem, 
borrowing mainly from the original proof of Richard Cox lITOl and from the proofs of 
Edwin Jaynes (|2i|, § 2.1-2.2, pp. 24-35) and Kevin Van Horn ||40||, but first we prove the 
following Lemma: 

Lemma 1 Suppose that plausibilities {A\B), {B\AI), {B\I) and {A\BI) can be as- 
signed^ and that they completely and uniquely determine the plausibility {AB\I) for the 
logical product AB to be true. Then {AB\I) is to be a function either of {A\B) and 
{B\AI) orof{B\I) and iA\BI) only. 

Proof. According to the above assumptions, there are fifteen different combinations 
of plausibilities {A\B), (B\AI), {B\I) and (A\BI), corresponding to fifteen different 
subsets of arguments of a function H from which we might compute {AB\I): 



t = H{x) = H{u) 
t = H{y) = H{v) 



(151) 



and 



t = H{x,u), (152) 

t = H{y,v), (153) 

t = H{x,y)=H{u,v), (154) 

t = H{x,v) = H{u,y) , (155) 

t = H{x, y, v) = H{u, V, y) , (156) 

t = H{x,y,u) = H{u,v,x) (157) 

t = H{x,y,u,v) , (158) 



where we used abbreviations 

X = {A\I), y = {B\AI), u = {B\I), v = {A\BI) and t = {AB\I) . 



^Throughout the proof of Cox's Theorem it is always assumed that all considered degrees of plausibility 
can be assigned. 
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Interchangeability of x and u, as well as of y and v, in the above expressions is implied 
by commutativity of the logical product AB, AB = BA. We always assume that H is 
continuous, as well as differentiable and non-decreasing in all of its arguments. More- 
over, in the case where none of the arguments equals F, H should be strictly increasing in 
all of its arguments. 

The Lemma is then proved by the method of trying out all possible subsets of ar- 
guments and demonstrating that all but two of them inevitably lead to conclusions that 
contradict the basic Desiderata. 

a) In four out of the fifteen possible cases, if is a function of a single variable. Let, for 
example, in the case of t = H(x), A being true, e.g. a tautology. Then AB is 
equivalent to B (i.e. t is equivalent to u), so we have 

{B\I) = H{1) = constant, 

regardless the proposition B and the state of information I. By letting Bhea tautol- 
ogy and a false proposition, respectively, the above equation implies 1 = F, which 
evidently contradicts the requirements of Desideratum /. Similarly, an assumption 
A = B in the case of t = H{y) leads to the same kind of contradiction, so we 
conclude that the plausibility {AB\I) cannot be expressed as a function of a single 
variable. 

Z>) In a very similar way we can also rule out the possibility (11531) . Namely, by choosing 
A = i? we obtain 

t = H{1, 1) = constant . 

In the case of (I155I) . identical contradictions are obtained by choosing either A or 
B to be tautologies. 

c) According to the first one of the possibilities ( I156I ). 

t = H{x,y,v), (159) 
the plausibility (A(i?C)|/) is expressible as 

{A{BC)\I) =H[x, {BC\AI), {A\BCI)] , (160) 

where C, as long as it is consistent with both / and i?, is a completely arbitrary 
proposition. We can therefore choose C = A, implying 

{A{BC)\I) = t, {BC\AI) = y, {BC\I) = t and {A\BCI) = 1 , 

so that (11601) reduces to 

t = H{x,y, 1) . 

which is incompatible with the original assumption (11591) about the value of t being 
dependent on three independent variables x, y and v. 
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In the case (11581) where H is to be a function of four independent variables x, y, u and 
V, the plausibility (^A{BC)\I^ is expressible as 

{A{BC)\I) = H[x, {BC\AI), {BC\I), iA\BCI)] . (161) 

By setting C = A, (11611) reduces to 

t = H{x,y,t,l), (162) 

or, equivalently, to 

H{x,y,u,v) = H[x,y,H{x,y,u,v), l] . (163) 

Note that due to the interchangeability of x and u and of y and v, (11621) can be 
rewritten as 

t = H{t,l,x,y). (164) 

Differentiation of (11631) with respect to x, y, u and v, respectively, yields a system 
of four differential equations: 

Hi{x, y, u, v) = Hs [x, y, H{x, y, u, v), l] Hi{x, y, u, v) + Hi [x, y, H{x, y, m, v), l] 
H2{x, y, u, v) = ifg [x, y, H{x, y, u, v), l] H2{x, y, u, v) + H2 [x, y, H{x, y, m, v), l] 
H^{x,y,u,v) = H^[x,y,H{x,y,u,v), l] H^ix^y^u^v) , 
Hiix, y, u, v) = H-i [x, y, H{x, y, u, v), l] H^ix, y, m, v) . 

(165) 

Since H{x,y,u,v) is to be increasing in all of its arguments, the derivatives 
H^i^x, y, u, v) and -^4(0;, y, u, v) must be positive (i.e. different from zero) in the 
case of X, y, u and v all being different from F. Then, according to the latter two of 
the four equations, the derivative Hs [x, y, H{x, y, u, v), l] should equal unity, 

Hs{x,y,t,l) = l- (166) 
This, when inserted to the first equation of (11651) . further implies 

Hi [x, y, H{x, y, u, v),l] = Hi{x, y,t,l) = . (167) 

Since both (11661) and (11671) must hold for arbitrary propositions A and B , they must 
also hold for B being a tautology, implying 

t = {AB\I) = {A\I) = x md y= {B\AI) = 1 . 

In this case (11661) and (11671) read 

Hi{x,l,x,l) = and H3{x,l,x,l) = 1 . (168) 
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The same kind of reasoning as above, however, applied to equation (11641) . leads to 

Hi{x,l,x,l) = 1 and Hs{x^l, x^l) = , 

which, when compared to (11681 ). is an evident inconsistency. In this way the pos- 
sibility for t to be a function of four independent variables x, y, u and v, is finally 
rejected. 

Note that three additional possibilities for H, (11571) and (11521) . can be excluded by 
following the same patterns of reasoning as in the case of m, v). 

After trying out thirteen different possibilities we thus end up with only two admissi- 
ble subsets of arguments for H: x and y, or u and v, the latter being a consequence 
of the aforementioned interchangeability of variables. All other subsets, (11511) . (11521) . 
(11531) . (11551) . (11561) . (11571) and (11581) . have been ruled out as incompatible with the basic 
Desiderata. Therefore, if {AB\I) is truly to be uniquely and completely determined by 
the plausibilities {A\B), {B\AI), {B\I) and there must exist a function H (11541) . 

such that 

{AB\I) =H[{A\I), {B\AI)\ = H[iB\I), iA\BI)] , 
which completes the proof of the Lemma. 

In summary, existence of the function H (11541) thus represents the starting point of the 
proof of Cox's Theorem. Recall that Desideratum //. requires H to be strictly increasing 
and twice differentiable in both of its arguments, so we have 

Hi{x,y) > and H2{x,y) > , 

with equalities if and only if y and x represent impossibilities, respectively. 

Suppose now we try to find the plausibility (ABC\I) that three propositions, A , 
B and C, would be true simultaneously. Because of the fact that Boolean algebra is asso- 
ciative, ABC = {AB)C = A{BC), we can express the plausibility that we are searching 
for in two different ways, 

{iAB)C\l) = H[iAB\I),iC\ABI)] = H[Hix,y),z] 

and 

{A{BC)\l)=H[iA\I),iBC\AI)] = H[x,Hiy,z)], 

where another abbreviation, z = (C\ABI), was used. According to Desideratum ///.a, 
the two ways must lead to the same result, i.e. if our reasoning is to be consistent, function 
H must solve the Associativity Equation: 

H[H{x,y),z] =H[x,H{y,z)] (169) 

(see, for example, reference ll22l . § 6.2, pp. 253-273, and § 7.2.2, pp. 327-330, and refer- 
ences quoted therein). 
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In order to solve it, we first differentiate the Associativity Equation with respect to x 
and y, obtaining in this way: 

H,[H{x,y),z] H,{x,y) = H,[x,H{y,z)] (170) 

and 

H,[Hix,y),z] H2{x,y) = H^[x,H{y,z)\ H^{y,z) . (171) 
Dividing STTDi by (fTTUI) yields: 

K{x,y) = K[x,H{y,z)] H,{y,z) , (172) 

where 

Hi{x,y) 

Equation ( 11721 ) can also be rewritten as: 

K{x, y) K{y, z) = K [x, H{y, z)] H,{y, z) . (173) 

The right side of (11721) is independent of z, therefore when differentiated with respect to 
z, it vanishes for all x and y and z: 

K2 [x, H{y, z)] H,{y, z) H^iy, z) + K [x, H{y, z)] H.^^iy, z) = ■ (174) 

The right side of ( I173I ). when differentiated with respect to y, equals (I174I ). and must 
therefore vanish for all x, y and z, too. It is then evident that both the left and the right 
side of eq. (11731) must be independent of y, i.e. this means that 

^ In {K{x, y) K{y, ^)) = ^ In K{x, 2/) + ^ In K{y, z) = 
or, equivalently, 

— \D.K{x,y) = -—\nK{y,z) . 
ay ay 

This further implies both the right and the left side of the above equation to be independent 
of either x or z: 

^ In K{x,y) = --^ In K{y,z) ^ In h'{y) , (175) 
ay ay ay 

where 

h'{y) ^ ^h{y) 
dy 

is a strictly positive function of y. Permutation of the variables results in an expression 
very much like (11751) : 

In K{z, x) = -^\n K{x, y) = h'{x) . (176) 

CLJu (JjJb (JjJb 

11 



By subtracting equation (I176I) . multiplied by dx, from equation (11751) . multiplied by dy, 
we obtain: 



dhiK{x, y) = din 



h'{y) 



or, when integrated, 



h'{x) 
H2ix,y) Ih'ix) 



Hi{x,y) ah'{y) ' 
where a is an arbitrary integration constant. By introducing 

G{x,y) = ah{x) + h{y) , 

equation (11771) can be rewritten as: 

Hi{x, y)G2ix, y) - H2{x, y)Gi{x, y) = 



(177) 



We saw on two previous occasions (recall Section|5l eqns. (I50H53I) . and the repeated situ- 
ation in SectionfT^. that the general solution of such a functional equation reads: 



H{x,y) = k{G{x,y)) = k{ah{x) + h{y)) 



(178) 



where k is an arbitrary function of a single variable G{x,y). When inserted into the 
Associativity Equation, the solution (11781) yields: 



k< ah 



k[ah{x) + h{y)) + h{z) I = k< ah{x) + h k{ah{y) + h{z)) 



If the above equality is to be true for every z, then the function h k(ah{y) + h{z)) 
take the form: 

h k[ah{y) + h{z)) = l{y) + h{z) , 
or, equivalently, then the function k{ah{y) + h{z)) must take the form: 

k{ah{y)+h{z))=h-\l{y) + h{z)) , 



(179) 



must 



(180) 



where h^^ is an inverse function of h, while l{y) is a function of y whose form we are 
about to determine. In order to do that, we first differentiate equation (11801 ) with respect 
to z and obtain: 

k'{ah{y) + h{z)) h\z) = {h-^y{l{y) + h{z)) h'{z) , 
thus implying equality between k' and {h'^)'- 

k'{ahiy) + hiz)) = ih-'y{l{y) + hiz)) . 



Taking this equality into account, we also differentiate (11801 ) with respect to y and obtain 
a differential equation: 

ah'{y) = l\y) , 
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whose integral reads: 

l{y) = ah{y) + h, (181) 
where h is an integration constant. This then implies the form of H{x, y) to be: 

H{x,y) = h-^{ah{x) + h{y) + b) . (182) 

To determine the value of the constant a, let us insert the solution (11821) into (11791) . 
obtaining in this way an equation: 

a{ah{x) + h{ii) + 6) + h{z) = ah{x) + ah{y) + h{z) + b , 

or 

{ah{x) + 6) (a - 1) = , 

with two possible solutions: a = 6 = 0ora = l. Since the former violates the re- 
quirement for monotonicity of H(x, y), Hi(x, y) > 0, the only acceptable solution of the 
Associativity Equation, reads: 

H{x,y) = h-\h{x) + h{y) + b) , 

or, written in terms of plausibilities: 

h[{AB\I)] = h[{A\I)] + h[{B\AI)] + b . (183) 

By exponentiation, the solution takes the form 

w{AB\I) = w{A\I)w{B\AI)e\ (184) 

with 

w{A\I) = w[{A\I)] = exp{h[{A\I)]} 

being a function of plausibilities that is by construction both positive and strictly increas- 
ing with respect to its argument. 

Suppose now that given information /, a proposition A is certain, i.e. true beyond 
any reasonable doubt, and that B is another proposition. Then, the state of knowledge 
about the propositions A and B being simultaneously true, AB, is the same as the state 
of knowledge about only B being true, which can be expressed by a simple equation of 
Boolean algebra as: 

AB = B . 

Therefore, by Desideratum ///.c, we must assign equal plausibilities for AB and B, 

{AB\I) = {B\I) , 

and we also will have 

{A\BI) = iA\I) 
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because if A is already certain given /, then given any other information B not contra- 
dicting I, it remains certain. In this case the product rule reads: 

w{B\I) =w{A\I)w{B\I) e\ 

so our function w{x) must have a property that for certain events A 

w{A\I)^e-\ 

With no loss of generality we may choose the value of the constant b to be zero, i.e. the 
value of function w{A\I) for a certain event A to be one. In other words, any continuous 
positive strictly increasing function w with upper bound can be renormalized by being 
multiplied with so that its upper bound equals unity. Note that the compositum of 
plausibility assignment {A\I) and the renormalized functions w, ^(^1/) meets all the 
requirements for plausibilities, i.e. renormalized functions ^(^1/) are also plausibilities 
by themselves. Then, our general product rule takes the form: 

w{AB\I) = w{A\I) w{B\AI) = w{B\I) w{A\BI) , (185) 

where in the case that, apart from {A\I) and {B\AI), also plausibilities {B\I) and {A\BI) 
can be assigned, the last equality is due to Desideratum ///.a. Evidently, the product rule 
can also be rewritten as 

= = w'^iBlI) , (186) 

where a is a non-zero but otherwise arbitrary constant. 

Now suppose that A is impossible, given I. Then, the proposition AB is also impos- 
sible given /: 

w{AB\I) =w{A\I) , 

and if A is already impossible given /, then, given any further information B which does 
not contradict /, A would still be impossible: 

w{A\BI) =w{A\I) . 

In this case, the product rule reduces to 

w{A\I) =w{A\I)w{B\I) , 

which must hold regardless of plausibility for B , given /. There are three possible values 
of w{A\I) that could satisfy this condition: it could be either — oo, oo or zero. The choice 
— oo is ruled out due to the requirement for plausibilities to take non-negative values, 
while oo contradicts the requirement for plausibilities to be monotonically increasing, 
both thus implying the plausibility for impossible events uniquely to be zero. 

Now, in order to derive the sum rule, we suppose that the plausibility for proposition 
A to be false, given information I,w{A\I), must depend in some way on the plausibility 
w{A\I) that it is true, i.e. there must exist some functional relation 

w{A\I) = S[w{A\I)] . 
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Qualitative correspondence with common sense requires that S [w{A\I)) be a continuous, 
twice differentiable, strictly decreasing function with the extreme values 



S{0) = 1 and S{1) = . 



(187) 



But it cannot be just any function with these properties, for it must be fully consistent 
with the product rule. We make use of the latter and express the plausibility for A and 
B simultaneously to be true as: 



w{AB\I) = w{A\I)w{B\AI) = w{A\I) S[w{B\AI)] = w{A\I) S 



w(AB\AI) 
w{A\I) 



We also invoke consistency of the product rule, so that: 



w(A\I)S 



w{AB\AI) 



w{B\I)S 



w{AB\AI) 
w{B\I) 



w{A\I) 

Since this must hold for every A and B, given /, it must also hold when 

B = A + C, 

i.e. when 



(188) 



(189) 



B = AC, 

where C is any new proposition. But then, according to simple results of Boolean algebra: 

AB = B and AB = A , 

and by using the abbreviations 

X = w{A\I) and y = w{B\I) , 
(11881) becomes a functional equation 

X S 



By defining new variables, 



u = 



\Siy)^ 


= yS 




X 




. y . 


S{y) 


and V 


S{x 


X 




~ y 



the functional equation is further reduced to 

X S{u) = y S{v) 



(190) 



On the way to solution, we differentiate (11901) with respect to x, y, and x and y, respec- 
tively, obtaining in this way: 



S{u) — u S'{u) = S'{u) S'{x) 



(191) 
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S'iu)S'iy) = Siv)-vS'iv), (192) 

and 

S"{u) S'{y) - = S"{v) S'{x) - . (193) 
X y 



In order to eliminate x and y, we multiply equations (11901) and (11931) . 

S"{u) S{u) S'{y) u = S"{v) S{v) S'{x) v , 

and express S'{x) and S'{y) from ( 11911 ) and ( 11921) . arriving in this way to: 

S"{u)S{u)u _ S"{v)S{v)v 
S'{u) [S{u) - u S'{u)\ ~ S'{v) [S{v)-vS'{v)\ ' 

Then, evidently both sides of the above equation must equal a constant, say a — 1, and the 
above equation splits into two identical differential equations of the form: 

dS' , f du dS\ 

The solution that is obtained by two successive integrations and that satisfies the boundary 
conditions (11871) . reads: 

S(u) = {l-u'')~- . (194) 
In this way we obtained the so-called sum rule: 

w''{A\I)+w\A\I) = l . (195) 



Since our derivation of the functional equation ( 11901 ) used the special choice ( 11891) for 
B , ( 11941 ) is a necessary condition to satisfy the general consistency requirement (I188I) . 
To check for its sufficiency, we substitute ( 11941) in (11881) and obtain an evident equality 
(c.f. eq. dMl)): 

w{A\I) w{B\AI) = w{B\I) w{A\BI) . 

Therefore, equation ( 11941 ) is the necessary and sufficient condition on S{x) for consis- 
tency in the sense (I188I) . 

Out of all possible plausibility functions ui (A | J) we then choose the probability P(A\I), 

P{A\I)^w''{A\I), 

for which the product and the sum rule evidently take the forms: 

P{AB\I) = P{A\I) P{B\AI) = P{B\I) P{A\BI) 

and 

P{A\I) + PiA\I) = 1 . 
This completes the proof of Cox's Theorem. 
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B More on the marginalization paradox 



In 1972, Stone and Dawid published an article ll32ll containing two examples of the so- 
called marginalization paradox. The examples refer to inferences that could be made in 
two different ways leading to two different results, despite the two possible ways being 
completely equivalent. Since at least one of the two chosen ways involves the procedure 
of marginalization (l23b . such inconsistencies are usually referred to as the marginaliza- 
tion paradox. Stone and Dawid judged the usage of improper (i.e. non-integrable) non- 
informative priors as the cause of the paradox. 

If correct, these arguments would seriously threaten the consistency of probability 
theory, for we saw that our consistency factors for location, scale and dispersion param- 
eters, (II 101) and (11141) . despite being conceptually different from non-informative priors, 
are all non-integrable over their particular (infinite) domains. But since the form of the 
consistency factors is uniquely determined by the Cox-Polya-Jaynes Desiderata (see Sec- 
tionfTTTl. the arguments of Stone and Dawid, if correct, would imply that it is impossible 
to construct a consistent and calibrated theory of qualitative inference about parameters 
of sampling distributions. 

If the appropriate consistency factors (II 101) and (II 141) are used, there is no paradox in 
the first of the two aforementioned examples. The origin of the paradox in Example #2, 
however, is better camouflaged and is discussed in the following. 

Let the data x = {xi, X2, Xn) consist of n > 2 observations from the normal 
sampling distribution A^(yU, a). Suppose that before the data x were collected we had been 
completely ignorant about the values of the parameters and that we are not interested in 
either of the two parameters separately but only in their ratio 9: 



Let the inference be made by two persons, say Mr. A and Mr. B, each choosing his 
own way to obtain the probability distribution for 9. Mr. A strictly obeys the rules of 
probability as developed in the present paper and makes use of Consistency Theorem 
( l^?l) . of the consistency factor (II 141) . and of the sequential use of Bayes' Theorem d^^ . 
in order to obtain first a two-dimensional pdf for n and a: 





(196) 
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where: 



and 



X 



^ n ^ n 

= /,'^« 1 ^ — / X-^i 

n ^-^ n ^-^ 

i=l i=l 



X 



nx 



n{x + s ) , r = — . 

K 



The admissible range of variables x is the entire real axis, for and R 
its positive half, while the range of r is the interval {—^/n, \/n). 
By multiplying the above pdf by the appropriate Jacobian, 

d{ii,a) 



i?2 it is only 



Mr. A obtains the pdf for and cr. 



/(^(t|x/) oc — exp 



a 



a' 



R^ rRe 
+ 



2(T 



a 



Since he is interested only in 9 but not in a, he integrates out the latter in order to obtain 
the marginal pdf for 6, 



f{e\^I) = / f{da\^I) da oc e--' H„.2{r, 
Jo 

where function Hn{r, 0) is introduced as: 

Hn{r,e) = J u"exp| — - +reuj du . 



(197) 



Evidently, Mr. A's pdf for 6* is a function of statistics r alone and, when also appropriately 
normalized, it reads: 

e" 



/(^|x/) 



rj{r) 



(198) 



where r]{r) is the usual normalization factor. 

Seeing Mr. A's result, Mr. B decides to simplify the calculation. His intuitive judgment 
is that he ought to be able equivalently but more easily derive the (marginal) pdf for 9 by 
a direct application of Consistency Theorem in a reduced model, i.e. that 9 should be 
inferred directly from the sampling distribution for r only, following in this way the so- 
called reduction principle L33J . He starts with the appropriate pdf for x and s^. 



f{xs^\9al') oc a-" s""^ exp 
transforms it into the pdf for r and R, 

d{r, R) 



n 



2a2 



[{X - ^^f + s'] 



f{rR\9aI') = f\xs^\9al' 



n — 3 



oc 



a" 



[n — r^) ^ exp 



rR9 
+ 



2a2 



a 



(199) 
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and then reduces it by means of marginalization to a pdf for r only: 

f{r\9ar)= / f{rR\eaI')dR(x{n-r^) ^ e""^ ^„_i(r, ^) . (200) 
Jo 

Note that the pdf for r is independent of parameter a, i.e. f{r\9al') = f{r\6I'). Then, 
according to Consistency Theorem, Mr. B's pdf for 6 reads: 

f{e\rl') = 1^ f{r\ei') = 1^ e-"^^/^ /7„„i(r, 6) , (201) 

with TT (6') and ?7(r) being the appropriate consistency and normalization factors, the latter 
also containing the factor (n — r^)^ from equation ( 12001) . 

Now, if the two marginal pdf 's for 9 of Mr. A and Mr. B, (11981) and (12011) . are to be 
equal, there must exist a consistency factor tt{9) such that 

h{r) 1,{9) H^_,{r, 9) = H^^^^ir, 9) (202) 

for all real r,9 E (— oo, cxd) and for every integer n > 2, where 

T]{r) 



h{r) = 



Tjir) 



But it is easy to demonstrate by recluctio ad absurdum that a general solution n of equation 
(12021) does not exist. For a moment we therefore suppose that such a solution indeed 
exists. By differentiating logarithm of (12021) with respect to r (with respect to 9) we 
obtain 

h'{r) ^ fHn^iir^ _ H4r,9) \ ^^^^^ 



(204) 



h{r) \H^.2{r,9) Hn-i{r,9) 
and 

^\9) fR^-i{r,9) H^{r,9) 



ni9) \H^-2ir,9) i/„_i(r,6 
where we made use of the following properties of functions 

^H^{r,9) = 9H,,+i{r,9) and ^i7„(r, ^) = r ff„+i(r, 
or o9 

When divided, equations (12031) and (12041 ) yield a functional equation: 

h{r) n{9) ' 
whose general solution are functions h(r) and tt{9) of the form 

h{r) oc r'^ and tt{9) oc 9^ , 
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with c being a real constant that could be either positive, negative or exactly zero. When 
inserted in equation (12021) . functions h(r) and n(9) imply: 

Hn-2ir,e) = Cr'e'Hn-i{r,e) ;c> 0, (205) 
H^_^{r,9) = Crl^l0l^liJ„_2(r,e) ;c< 0, (206) 
Hn-2{r,e) = CK^-i{r,e) ;c = 0, (207) 

where C is an arbitrary constant. 

If this is to be true for any value of 9 within its permissible range, it must also be 
true for 9 = 0. But then, for positive c, this implies Hn-2i'i", 9 = 0) = 0, which, when 
compared to i/„_2(^! 9 = 0)= 2^ that follows directly from the definition of 

Hn (11971) . is a clear contradiction. For negative c we derive a contradiction in an identical 
way. For r ^ 0, the result of differentiation of equation ( 12071 ) that corresponds to the 
value of c being exactly zero, reads: 

H^.iir ^0,9) = CH^{ry^ 0, 9) . (208) 

Dividing (12071) with (12081) and again setting 9 = yields: 




which is, since it is to be valid for any integer n > 2, an evident contradiction. Hence, the 
proof is completed. 

Stone and Dawid in 0^ . as well as Dawid, Stone and Zidek in 0^ . considered it 
obvious that Mr. A and Mr. B made their inferences about 9 from the same information 
(i.e. from the measured value of r) and should therefore indeed come to the same conclu- 
sions: the pdf 's for 9, (11981 ) and (12011) ought to be identical. The fact that Mr. B cannot 
reproduce Mr. A's result whatever consistency factor n{9) he chooses, served them as a 
proof that either Mr. A of Mr. B must be guilty of some transgression. According to their 
reasoning, the blame is to be put on Mr. A for using a non-integrable consistency factor 
7r(/i, a) dm. 

From the standpoint that we advocate in the present paper, it is evident that such rea- 
soning is unjustified. First, as we have stressed in many places and especially in Sections|5] 
and 121 in the limit of complete prior ignorance about the value of an inferred parameter, 
information about the parameter of a sampling distribution after recording events from 
that particular distribution consists of two pieces: of the value of the likelihood and of 
the form of the sampling distribution (i.e. of the specified model). But the model / (11961) 
of Mr. A is different from the model /' (12001) of Mr. B. In particular, Mr. B obtained the 
reduced model ( 12001 ) by marginalization of the sampling distribution (11991) . Since every 
marginalization represents an irreversible process (a reverse transformation from the re- 
duced model to the original one does not exist), Mr. B is deliberately throwing away avail- 
able information, acting in this way against the consistency Desideratum Ill.b. Therefore, 
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Mr. A and Mr. B do not infer parameter 6 from the same information so it is natural that 
they come to different conclusions (see also ref. Q, § 15.8, pp. 472-473). The fact that 
the inferences of Mr. A and Mr. B will never coincide is thus an unequivocal proof that 
some of the cogent information was lost during Mr. B's reduction of his model. More- 
over, since there is one and only one logically consistent procedure (i.e. a procedure that 
is in a complete accordance with the basic Desiderata) to infer and this is the proce- 
dure applied by Mr. A, and since the reduction of the sampling distribution put Mr. B in 
a position where he cannot find the appropriate consistency factor that would reproduce 
Mr. A's result, Mr. B is no longer able to make any kind of consistent inference about the 
parameter 9. 

Second, Mr. A is perfectly capable of making predictions about the value of the in- 
ferred parameter that are verifiable at long-range consequences. We performed numerous 
Monte Carlo experiments and inferred the value 6 of the ratio of parameters ji and a of 
the generator of normally distributed random numbers. As expected, confidence intervals 
based on Mr. A's marginal distribution (11981 ) for 6 covered the true value (i.e. the ratio 
of the parameters that were actually used by the generator) exactly in the percentage 5 
of cases that was predicted according to (11251) . The coverage was exact regardless of the 
number n of observations that we based our inference upon each time, the chosen value 
5, and the type of chosen confidence interval (it could have equally well been the shortest 
of all possible intervals, the central interval, the lower-most or the upper-most interval, or 
any other interval with the chosen probability (11251) that equals 5). That is, the predictions 
of Mr. A are calibrated. On the other hand, the predictions of Mr. B who is not able to 
reproduce the predictions of Mr. A, will thus necessarily be non-calibrated. 

Third, the inference of Mr. A was based only on steps that are all (including the use 
of the consistency factor (II 141) and the procedure of marginalization of the pdf f{6a\yil)) 
deduced directly (i.e. without any other assumptions) from the basic Desiderata. In par- 
ticular, absolutely no assumption was ever made about the existence of an integral of the 
consistency factor over its domain: it need not exist, neither is ii forbidden to exist. That 
is, for Mr. A the existence of the integral is completely irrelevant and we therefore see 
no reason whatsoever why the eventual non-integrability of the consistency factor (II 141) 
could be any kind of transgression. 

After all is said and done, there are only two possibilities left at Mr. B's disposal. 
He can either correct his intuitive reasoning and abandon the principle of reduction, or 
he can develop a completely new theory of inference on his own by incorporating the 
reduction principle in his set of basic rules. But, as exhibited above, such a theory would 
necessarily be logically inconsistent (e.g. it would allow, among other things, some of the 
available information that is relevant to a particular inference to be ignored), as well as 
non-operational, and thus of no practical importance. 

The basic Desiderata and their direct applications such as the Cox, Bayes and Consis- 
tency Theorems, are adequate for conducting inference so they must always take prece- 
dence over intuitive ad hoc devices like the above principle of reduction. We agree with 
Edwin Jaynes (|2|, § 15.7, p. 469) that in order to avoid inconsistencies, the rules of in- 
ference must be obeyed strictly, in every detail. Intuitive shortcuts that violate those rules 
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might, by a coincidence, lead to correct results in some very special cases, but will in 
general lead to inconsistent and non-calibrated inferences. 
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